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The Entropy Concept in Probability Theory 


(Uspekhi Matematicheskikh Nauk, vol. VIII, no. 3, 1953, pp. 3-20) 


In his article ‘‘On the Drawing of Maps” P.L. Chebyshev 
beautifully expresses the nature of the relation between scien- 
tific theory and practice (discussing the case of mathematics): 
“The bringing together of theory and practice leads to the 
most favorable results; not only does practice benefit, but the 
sciences themselves develop under the,influence of practice, which 
reveals new subjects for investigation and new aspects of 
familiar subjects.” A striking example of the phenomenon 
described by Chebyshev is afforded by the concept of entropy 
in probability theory, a concept which has evolved in recent 
years from the needs of practice. This concept first arose in 
attempting to create a theoretical model for the transmission 
of information of various kinds. In the beginning the concept 
was introduced in intimate association with transmission ap- 
paratus of one kind or another; its general theoretical signi- 
ficance and properties, and the general nature of its application 
to practice were only gradually realized. As of the present, a 
unified exposition of the theory of entropy can be found only 
in specialized articles and monographs dealing with the trans- 
mission of information. Although the study of entropy has 
actually evolved into an important and interesting chapter of 
the general theory of probability, a presentation of it in this 
general theoretical setting has so far been lacking. 
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This article represents a first attempt at such a presentation. 
In writing it, I relied mainiy on Shannon’s paper “The Mathe- 
matical Theory of Communication”.* | However, Shannon’s 
treatment is not always sufficiently complete and mathemati- 
cally correct, so that besides having to free the theory from 
practical details, in many instances I have amplified and changed 
both the statement of definitions and the statement and proofs 
of theorems. There is no doubt that in the years to come the 
study of entropy will become a permanent part of probability 
theory; the work I have done seems to me to be a necessary 


stage in the development of this study. 


#1. Entropy of Finite Schemes 

In probability theory a complete system of events A,, As, 
--+,A, means a set of events such that one and only one of 
them must occur at each trial (e.g., the appearance of 1, 2, 8, 
4,5, or 6 points in throwing a die). In the case n=2 we have 
a simple alternative or pair of mutually exclusive events (e.g., 
the appearance of heads or tails in tossing a coin). If we are 
given the events A,, A,,---, A, of a complete system, together 
with their probabilities p,, p.,---,p, (p,0, >) P=), then we 
say that we have a finite scheme 


7 Ne iia (1) 
Pr Pot + Da . 


In the case of a “true” die, designating the appearance of 7 
points by A, (14126), we have the finite scheme 


(2 A. Ay Ay AL As ) 
1/6 1/6 1/6 1/6 1/6 1/6/- 


* C.E. Shannon, Bell System Technical Journal, 27, 379-423: 623-656 (1948). 


The Entropy Concept in Probability Theory 3 


Every finite scheme describes a state of uncertainty. We 
have an experiment, the outcome of which must be one of the 
events A,, A.,---,A,, and we know only the probabilities of 
these possible outcomes. It seems obvious that the amount of 


uncertainty is different in different schemes. Thus, in the two 


Geass Ateneo 
0.5 0.57» 0.99 0.01/ > 
the first obviously represents much more uncertainty than the 


second; in the second case, the result of the experiment is 
“almost surely” A,, while in the first case we naturally refrain 


simple alternatives 


from making any predictions. The scheme 


(A; ) 


\0.3 0.7 


represents an amount of uncertainty intermediate between the 
preceding two, etc. 

For many applications it seems desirable to introduce a 
quantity which in a reasonable way measures the amount of 
uncertainty associated with a given finite scheme. We shall 


see that the quantity 
A(D,, Dest ++) Pa=— 2 Px Ig p,, 


can serve as a very suitable measure of the uncertainty of the 
finite scheme (1); the logarithms are taken to an arbitrary but 
fixed base, and we always take p,lgp,=0 if p,=0. We shall 
call the quantity H(p,, p:,--+,p,) the entropy of the finite 
scheme (1), pursuing a physical analogy which there is no need 
to go into here. We now convince ourselves that this function 
actually has a number of properties which we might expect 
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of a reasonable measure of uncertainty of a finite scheme. 

First of all, we see immediately that H(p,, p.,---,p,)=0, if 
and only if one of the numbers 79,, p.,---, p, 1S one and all the 
others are zero. But this is just the case where the result of 
the experiment can be predicted beforehand with complete 
certainty, so that there is no uncertainty as to its outcome. 
In all other cases the entropy is positive. 

Furthermore, for fixed 7 it is obvious that the scheme with 
the most uncertainty is the one with equally likely outcomes, 
i.e., D,=1/n (kK=1,2,---,n), and in fact the entropy assumes its 
largest value for just these values of the variables p,. The 
easiest way to see this is to use an inequality which is valid 


for any continuous convex function (2) 


where @,,@.,:-+,@, are any positive numbers. Setting a,=p, 


and y(%)=2lg x, and bearing in mind that ST p,=1, we find 
k=l 


1 
»(=)=+1e — £— 23? 1g D.= ~=H, Dorr ty Dis 


n 


whence 


H(p,, Poyr*? -,7p,) 4lgn= H(-, i gen sty =) Q.E.D. 
nm n n 


Suppose now we have two finite schemes 


A=(“ a) ee ate: 

DP, Ps** Pn q1 Jo°** In 

and let these two schemes be (mutually) independent, ie., the 
probability 7,, of the joint occurrence of the events A, and 
B, is p,q, Then, the set of events A,B, (1Z2k4Zn, 1421 2m), 
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with probabilities 7,, represents another finite scheme, which 
we call the product of the schemes A and B and designate by 
AB. Let H(A), H(B), and H(AB) be the corresponding entro- 
pies of the schemes A, B, and AB. Then 


H(AB)=H(A)+H(B), (2) 


\ 


for, in fact 
—A(AB)= D335 Me 1S Ma = DID Pe IE Pe H1E 1) = 
=D Pe lB Pe Ut DUNE a 35 P= — H(A)— AB). 
We now turn to the case where the schemes A and B are 
(mutually) dependent. We denote by q,, the probability that 


the event B, of the scheme B occurs, given that the event A, 
of the scheme A occurred, so that 


Tr=Ddn AZkZn, 1Z1ZMm). 
Then 


— H(AB)= p> PAu lg Pete y= 
aa?) p,. 1 P, Dy Mut 2 Pr 2 ut Ig Qu. 


Here $}q,,=1 for any k, and the sum —$}q,,lg9,, can be 
négarded as the conditional entropy H,(B) of the scheme B, 
calculated on the assumption that the event A, of the scheme 
A occurred. We obtain 


H(AB)= H(A) +>; P.H,(B). 


The conditional entropy H,(B) is obviously a random variable 
in the scheme A; its value is completely determined by the 
knowledge of which event A, of the scheme A actually occurred. 
Therefore, the last term of the right side is the mathematical 
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expectation of the quantity H(B) in the scheme A, which we 
shall designate by H,(B). Thus in the most general case, we 
have 

H(AB)=H(A)+4,(B). (3) 
It is self-evident that the relation (3) reduces to (2) in the 


special case where the schemes A and B are independent. 

It is also interesting to note that in all cases H,(B) Z H(B). 
It is reasonable to interpret this inequality as saying that, on 
the average, knowledge of the outcome of the scheme A can 
only decrease the uncertainty of the scheme B. To prove this, 
we observe that any continuous convex function f(x) obeys 


the inequality* 
DAS @)S SD, 
if 2,0 and d4,=1. Therefore, setting /(x)=2 lg z, 
a= Prev m= Aus we find for arbitrary lJ that 
Do PeGer 1S Ter (33 Pees) EQ PMI =U IEW, 
since obviously >} p,q¢,,=9¢,. Summing over l, we obtain on the 


& 


left side the quantity 
2 Pe 3390 IE Uer= — Da Pil B)= — H,(B), 
and consequently we find 
—H,(B)X ps q.ig¢q,=—H(B), Q.E.D. 
If we carry out an experiment the possible outcomes of which 


are described by the given scheme A, then in doing so we 


obtain some information (i.e., we find out which of the events 


* See, for exampie, Hardy, Littlewood, and Pdélya, Inequlities, Cambridge University 
Press, 1934. 
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A, actually occurs), and the uncertainty of the scheme is 
completely eliminated. Thus, we can say that the information 
given us by carrying out some experiment consists in removing 
the uncertainty which existed before the experiment. The 
larger this uncertainty, the larger we consider to be the amount 
of information obtained by removing it. Since we agreed to 
measure the uncertainty of a finite scheme A by its entropy 
H(A), it is natural to express the amount of information given 
by removing this uncertainty by an increasing function of the 
quantity H(A). The choice of this function means the choice 
of some unit for the quantity of information and is therefore 
fundamentally a matter of indifference. However, the proper- 
ties of entropy which we demonstrated above show that it is 
especially convenient to take this quantity of information pro- 
portional to the entropy. Indeed, consider two finite schemes 
A and B and their product AB. Realization of the scheme AB 
is obviously equivalent to realization of both of the schemes A 
and B. Therefore, if the two schemes A and B are independent, 
it is natural to require the information given by the realization 
of the scheme AB to be the sum of the two amounts of in- 
formation given by the realization of the schemes A and B; 
since in this case 


H(AB)=H(A)+H(B), 


this requirement will actually be met, if we consider the amount 
of information given by the realization of a finite scheme to 
be proportional to the entropy of the scheme. Of course, the 
constant of proportionality can be taken as unity, since this 
choice corresponds merely to a choice of units. Thus, in all 
that follows, we can consider the amount of information given 
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by the realization of a finite scheme to be equal to the entropy 
of the scheme. This stipulation makes the concept of entropy 
especially significant for information theory. 

In view of this stipulation, let us consider the case of two 
dependent schemes A and B and the corresponding relation (3). 
The amount of information given by the realization of the 
scheme AB is equal to H(AB). However, as explained above, 
in the general case, this cannot be equal to H(A)+H(B). 
Indeed, consider the extreme case where knowledge of the 
outcome of the scheme A also determines with certainty the 
outcome of the scheme B, so that each event A, of the scheme 
A can occur only in conjunction with a specific event B, of 
the scheme B. Then, after realization of the scheme A, the 
scheme B completely loses its uncertainty, and we have H,(B)=0; 
moreover, in this case realization of the scheme B obviously 
gives no further information, and we have H(AB)=H(A), so 
that relation (8) is indeed satisfied. In all cases, the quantity 
H,{B) introduced above is the amount of information given by 
the scheme B, given that the event A, occurred in the scheme 
A; therefore the quantity H,(B)=>)p,H,(B) is the mathe- 
matical expectation of the amount ‘of additional information 
given by realization of the scheme B after realization of 
scheme A and reception of the corresponding information. 
Therefore, the relation (3) has the following very reasonable 
interpretation: The amount of information given by the reali- 
zation of the two finite schemes A and B, equals the amount of 
information given by the realization of scheme A, plus the 
mathematical expectation of the amount of additional nforma- 
tion given by the realization of scheme B after the realization 


of the scheme A. In just the same way we can give an 
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entirely reasonable interpretation of the general inequality 
H,(B)Z H(B) proved above: The amount of information given 
by the realization of a scheme B can only decrease if another 
scheme A 1s realized beforehand. 


#2. The Uniqueness Theorem 


Among the properties of entropy which we have proved, 
we can consider the following two as basic: 


1. For given n and for S}p,=1, the function H(p,, pa ++, D,) 
al 


k 


takes its largest value for p,= L (k=1, 2,---,n). 
n 


2. H(AB)=H(A)+4H,(B). 
We add to these two properties a third, which obviously must 
be satisfied by any reasonable definition of entropy. Since the 
schemes 
& Az++> a and ex A, +++ A, os) 
Dy, Po*** Dy PD, Dees Py, O / 
are obviously not substantively different, we must have 
3. (py Dat **, Dy» O0=HA(p,, Day-++, P,). (Adding the impos- 
sible event or any number of impossible events to a scheme 
does not change its entropy.) We now prove the following 
important proposition: 
Theorem 1. 

Let H(p,, Do,+*-,D,) be a function defined for any integer 
nand for all values p,, P2,-++,p, such that p,<0 (k=1,2,---, 2), 
S}p,=1. If for any n this function is continuous with respect 
fo. all its arguments, and tf it has the propertves 1, 2, and 3, 
then 


A( py, Daye +) Di) = TAS Pi lg Dy, 
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where 2 is a positive constant. 

This theorem shows that the expression for the entropy of 
a finite scheme which we have chosen is the only one possible 
if we want it to have certain general properties which seem 
necessary in view of the actual meaning of the concept of 
entropy (as a measure of uncertainty or as an amount of 


information). 


Proof. 


For brevity we set 


we shall show that L(n)=Algn, where 2 is a positive constant. 
By 3 and 1, we have 


Loy=H(44,...4 o)zn(—, oe 24 <i )=Ln+)), 








so that L(n) is a non-decreasing function of n. Let m and r 
be positive integers. Consider m mutually independent finite 
schemes S,,S.,---,S,, each of which contains r equally likely 


events, so that 
H(S)=H(4, A ASL) (12k 2m). 
Tr a 


By Property 2 (generalized to the case of m schemes) we have, 


in view of the independence of the schemes S, 
H(S,S.-++S,,)= 1 H(S)=mL(r). 
k=l 


But the product scheme S,S.---S,, obviously consists of r” 
equally likely events, so that its entropy is L(r”). Therefore 
we have 


The Entropy Concept in Probability Theory ll 


L(r")=mL(r), (4) 
and similarly, for any other pair of positive integers n and s 
L(s")\=nL(s). (5) 
Now let the numbers 7, s, and n be given arbitrarily, but 
let the number m be determined by the inequalities 
PoZs 2h, (6) 
whence 


migrZnigs<(m+l))lgr, 
(7) 
m / les a<mil 
n gr n n 





It follows from (6) by the monotonicity of the function L(7) 
that 
Lr") £ Ls") < L(r"*"), 
and, consequently, by (4) and (5) 
mL(r) 4 nL(s) Z(m+1)LK(r), 
so that 


m Hs) m1 (8) 


n Lr) n 1 





Finally, it follows from (7) and (8) that 
24 1 


L(s) _ lgs 
L(r) lgr 








n 





Since the left side of this inequality is independent of m, and 
since can be chosen arbitrarily large in the right side 


Ls) _ L(r) 
Igs Igr’ 
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which, in view of the arbitrariness of + and s, means that 
Lin)=Alg n, 


where 2 is a constant. By the monotonicity of the function 
L(n), we have 20, and our assertion is proved. 

This assertion represents the special case p,=I1/n (1 Zk Zn) 
of the theorem to be proved. We now consider the more 
general case, where the p, (K=1,2,---,n) are any rational 
numbers. Let 


p= 2s (k=1,2,--+, 2), 
g 


where all the g, are positive integers and S4i= g. Let the 


finite scheme A consist of n events with wnsbepiliies Dy) Pay? ty Dns 
Our problem consists in defining the entropy of this scheme. 
To this end, we consider a second scheme B, which is dependent 
on A and is defined as follows: The scheme B contains g 
events B,, B.,---,B,, which we devide into » groups, contain- 
inZ Qi, Joy'**,G, events, respectively. If the event A, occurred 
in scheme A, then in scheme B all the g, events of the k’th 
group have the same probability 1/g,, and all the events of 
the other groups have probability zero (are impossible). Thus, 
given any outcome A, of the scheme A, the scheme B reduces 


to asystem of g, equally likely events, so that the conditional 
entropy 


H,(B)=H(1/9,) 1/9.s° ++) V9) =Lo,) =alg a, 
which means that 
H,(B)= IP, A,(B)= ASD lg I.=2S)P Igp,.talgg. (9) 


We return now to the product scheme AB, consisting of the 


The Entropy Concept in Probability Theory 13 


events A,B, AZk Zn, 1212Zg). Such an event is possible 
only if B, belongs to the k’th group. Thus, the number of 
possible events A,B, for a givenk is g,, and the total number 


of possible events in the scheme ABis S19,=g9. The probability 
k=l 


of each possible event A,B, is obviously p,/g,=1/g, i.e., is the 
same for all the events. Thus, the scheme AB consists of g 
equally likely events, and therefore 


H(AB)=L(g)=2 lg g. 


Using property (2) and relation (9), we find 


alg g =H(A)+2 > p, lg p.+alg g, 


=l 


whence 
H(A)=H(p,, Pry *s Pu)=—A Dd) Pe lg p,. (10) 


Finally, relation (10) which we have proved for rational 
Di, Dy***,; Pp, Must be valid for any values of its arguments 
because of the postulated continuity of the function H(p,, p,, 
-+,p,). Thus the proof of Theorem 1 is complete. 

#3. Entropy of Markov chains 

Suppose we have a simple stationary Markov chain with 
a finite number of states A,, A.,---,A, and with the transition 
probability matrix p,, (¢,k=1,2,---,n). We denote by P, the 
probability of the state A, (12k Zn), so that in particular 


D> Pip =f; (l=1,2,- ae) n). (11) 


If the system is in state A,, then its transitions to the dif- 
ferent states A, (k=1,2,---,n) form a finite scheme 


(ee A, oie es) 
Pn Pig *** Dr,?? 
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the entropy of which 
A,= ee Pile Dir 


depends on 7 and can be regarded as a measure of the amount 
of information obtained when the Markov chain thoves one step 
ahead, starting from the initial state A,;. The average of this 
quantity over all initial states, i.e., the quantity 
H=3} P.A;= -> os P Di IE Dies 

is therefore to be regarded as a measure of the average amount 
of information obtained when the given Markov chain moves 
one step ahead. This quantity H, which we shall call the 
entropy of the chain in question obviously characterizes the 
chain as a whole; it is clear that it is uniquely determined by 
giving the state probabilities P; and the transition probabilities 
Pella, Aken. 

All the concepts which are defined for moving one step ahead 
can be easily and naturally generalized to the case of moving 
ahead an arbitrary number of steps r. If the system is in 
state A,, then it is easy to calculate the probability that in 
the next r trials we shall find it in the states Ay Ayye* >, A 
in turn, where k,, k.,---,k, are arbitrary numbers from 1 to n. 
Thus, the subsequent fate of a system initially in the state A, 
in the next r trials is described by a finite scheme (with n’" 
events), with a definite entropy which we designate by H,’ 


Kp 


and regard as a measure of the amount of information obtained 
in moving ahead r steps in the chain, starting from the initial 
state A; The quantity 


nm 
(7). NY (7) 
HO= 3) PHY 
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is to be regarded as the average amount of information given 
by moving ahead r steps in the given Markov chain. We shall 
call it the r-step entropy of the chain in question. The one- 
step entropy defined above can obviously be written as H”. 
If the notion of the quantity H‘’ as the average amount of 
information obtained in moving ahead r steps in a given Markov 
chain is to be reasonable, then it is natural to require that 


for arbitrary positive integers r and s we have 
ATP =HP4+ 


or, equivalently, H‘’=rH‘” for any positive integer r. It is 
easily seen that this is actually the case. In the first place, the 
relation H‘’=rH” is trivially true for r=1. Suppose now 
that it is true for some 71; we shall show that in this case 
H’P=(r+ DH. 

Let the system be in the state A;; the finite scheme which 
describes the fate of the system in the next r+1 trials, can 
then be regarded as the product of two dependent schemes: 

A) the scheme corresponding to the immediately following 
trial with the entropy H;” and 

B) the scheme describing the fate of the system in the 
next r trials; the entropy of this scheme is H;”, if the out- 
come of scheme A was the event A,. According to the general 


relation 

H(AB)=H(A)+H,(B), 
we have 

Hi"? =H + 3) alli”, 


and consequently, in view of (11) 
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Herevrs SIP, He P= SY PHO + SHS Ppa= 
i= i=l kvl i 
=HP+ SP. He = HY +H? =(r 4 1)H®, Q.E.D. 
ko] 


In a large number of cases, the application of the concept 
of the entropy of Markov chains in information theory is based 
on two fundamental theorems, which we shall prove in the 


next section. 


#4. Fundamental Theorems 


Suppose now that the Markov chain we are studying obeys 
the law of large numbers, ie. that in a sufficiently long 


° . . Mm, 
sequence of s consecutive trials the relative frequency — of 
s 


occurrence of the state A, will differ from P,; by an arbitrarily 
small amount, with a probability arbitrarily close to unity. In 
other words, for arbitrarily small «>0 and 6>0, and for suf- 
ficiently large s 


p{|™—p|>sh<e. (12) 
.t § 


For brevity, we shall call such a chain ergodic. 
Each possible result of the series of s consecutive trials of 


the given Markov chain can be written as a “sequence” 
Ay A,,,° ory A, (C) 


where k,, k.,---,k, are numbers from 1 to . The probability 
of realizing the sequence (C) does not depend on the part of 
the chain where the series of trials begins (because of the 


*) A.A. Markov showed that for this to be the case it is sufficient to assume that the 
chain in question Is ‘transitive’, i.e., that a transition is possible from any state to any 
other state in a sufficiently large number of steps. For the chains with which information 
theory Is concerned. this hypothesis is apparently always fulfilled. 
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AC) = Py, Payee Prats" ** Dey aks 
Let 7 and | be two arbitrary numbers from 1 to n, and let m,, 
be the number of pairs of the form k,k,,, (12Z7rZs) in which 
k,=1, k,,,=l. Then, clearly the probability of the sequence 
(C) can be written in the form 


w(C)=Pe, TT a”. (1) 


Theorem 2. 

Given ¢>0 and 7>0, no matter how small, for sufficiently 
large n all sequences of the form (C) can be divided into two 
groups with the following properties: 1) the probability p(C) of 
any sequence of the first group satisfies the inequality 


1 
g— 
Oo 7 2; (14) 


and 2) the sum of the probabilities of all sequences of the second 
group is less than é. 

In other words, all sequences with the exception of a very 
low probability group have probabilities lying between a7*‘"*” 
and a~*‘“-”, where a is the base of the system of logarithms 


used. Here H is the one-step entropy of the given chain. 


Proof. 
We shall agree to assign the sequence (C) to the first group 
if it has the following two properties: 


1. it is a possible outcome, i.e. p(C)>0, and 
2. for any 1,1 (1ZiZn, 1Z1Zn) the inequality 
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| M.,—S8P Di | < sé (15) 


obtains. All the other sequences are assigned to the second 
group. We shall show that this division satisfies both require- 
ments of theorem 2, if &>0 is sufficiently small and if s is 
sufficiently large. 

1) Suppose the sequence (C) belongs to the first group. It 
follows from (15) that 


M,,=sP,p, +886, |6,)<1, 121427, 1ZlZn. 


We now substitute these expressions for the numbers m,, in 
(13), where in doing so we must bear in mind that the require- 
ment that the given sequence be a possible outcome implies 
that m,,=0 when p,,=0. Thus in the product (18) we must 
restrict ourselves to factors such that p,,>0, which we shall 
denote by an asterisk on the product signs. We find 


AC)=P,, I Th" (py) Prin * 8%, 
lg - 1 =—lg P,,—82i3"" Pi lg usb) >." 61 Ig P= 


p(C) 
=—lg P,,.+sH—s5>)>5* 61 Di, 
, t t 


whence 


ig 1 
WC) _ 1 ae 
ooo p, Tee le aT 


il 











gay 
F gz 


This means that for sufficiently large s 


1 
lg —— 
Oar 


§ 





<4, 


where 7>0 is as small as we please if & is sufficiently small. 
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Thus, the first requirement of Theorem 2 is satisfied. 

2) Turning now to the calculation of the sum of the 
probabilities of all the sequences of the second group, we note 
first of all that the impossible sequences in this group are not 
included in this calculation, since their probabilities are zero. 
Thus, we only have to calculate the sum of the probabilities of 
those sequences for which the inequality (15) does not hold for 


at least one pair of indices 7, 1; to do this it suffices to calculate 
the quantity 


pa > P {| m,,—sP;p,, | 85}. 


First we fix the indices i andl. By (12) we have for sufficiently 
large s 


P{|m—sP,|< Ssl>i—e 


If the inequality { } is satisfied, and if s is sufficiently large, 
then m, is as large as we please, and therefore by Bernouilli’s 
theorem 


Therefore the probability of satisfying both of the inequalities 
5 
|m,—sP;, I< (16) 
| mm Bal <M: Los (17) 


exceeds (1—e)’>1—2¢. But it follows from (16) that 


&  .86 
| PuM,—SP;D;, |< Di Pie eas 
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and this together with (17) gives 
|m,—sP:Diz|<ds. (18) 
Thus for any 7 and l, we have for sufficiently large s 
P{\m,—sP,p,|<6s} >1—2e, 
which means that 
P{|m,,—sP;p,,|>8s} <2. 
From this it follows that 


> > P { | M,—sP;p,|\>6s} <2n’e. 


t=Lll=l 


Since the right side of this inequality is together with ¢ arbi- 
trarily small, the sum of the probabilities of all the sequences 
of the second group can be made as small as we please for 
sufficiently large s, i.e., the second requirement of Theorem 
2 is also satisfied. Thus Theorem 2 is proved. 

We now note that the number of all s-term sequences of the 
form (C) equals n‘, and we arrange these sequences in order 
of decreasing probability p(C). We select sequences from this 
series in the order in which we have arranged them until the 
sum of the probabilities of the sequences selected just exceeds 
a preassigned positive number 4 (0<A<1). We denote by N,(4) 
the number of sequences so selected. Theorem 2 permits us 
to make the following important estimate of the number N,(A). 
Theorem 3. lim BND — yy. 

In particular, the indicated limit does not depend on the 


number 2, if only 0<24<1 and 4 remains constant as s in- 
creases. 
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Proof. 


We agree to call a sequence (C) standard if its probability 
~(C) satisfies the inequality (14) where » is a fixed, arbitrarily 
small positive number. By Theorem 2, the sum of the proba- 
bilities of all non-standard sequences is arbitrarily small for 
sufficiently large s. The inequality (14) is equivalent to the 
inequality 


GN AO yg ay, (19) 


which therefore are characteristic of standard sequences (a 
denotes the base of the system of logarithms used). 

The following sequences are among the selected sequences 
(with sum of probabilities >2): 

1) all non-standard sequences with probabilities p(C)X% 
a~*“#-”* the sum of the probabilities of such seqences does not 
exceed the sum of the probabilities of all non-standard se- 
quences, which latter is for sufficiently large s less than any 
€>0, however small. 

2) a certain number M,(A) of standard sequences, the sum 
of the probabilities of which must be greater than A4—e (since 
the sum of the probabilities of the selected sequences exceeds 2). 

The non-standard sequences with probabilities p(C) Za°*“"*” 
can not be among the selected sequences, since according to 
Theorem 2 the sum of the probabilities of the standard se- 
quences by themselves exceeds 24. Thus, for all the selected 
sequences p(C)>a-““*”, and therefore the sum of the proba- 
bilities of the selected sequences is greater than N,(A)a-““*”. 
On the other hand, this sum is obviously less than 2+ 9, where 
q is the probability of the last sequence selected ; thus 


N,Q)aP <a+g<apav, 
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which means (since 4<1) that for sufficiently large s 
Na ve 1, 
whence 


ae (20) 


On the other hand, the M,(4) standard sequences which we 
have selected have probabilities p(C)<a~*“*-”, while their sum 
is larger than 24—e. It follows that 

M,(Aa-**-" >a—e, 
so that a fortiori 
N,(Q)a- "4"? >a—e, 


whence 


1B NWO) > Hn +t lg (4—e). (21) 


Combining the inequalities (20) and (21) obviously proves 
Theorem 3, since the number 7 can be made arbitrarily small 
for sufficiently large s. 

The great value of Theorem 3 for a variety of applications 
depends on the following considerations. Whereas the number 
of sequences of type (C) is n‘=a''*", the number N,(A) of 


‘¥, as Theorem 3 shows. 


selected sequences is approximately a 
If we recall that Ign is the maximum value of the entropy 
Hand that, consequently, we always have H<lg n (except for 
a trivial case) and if we choose the number 4 very close to 
unity, then Theorem 3 shows that a negligibly small fraction 
of all the sequences (C) has a sum of probabilities arbitrarily 
close to unity (for sufficiently large s). Moreover, we see that 
the entropy of the given Markov chain plays a decisive role in 


determining how small this fraction is. 
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$5 Application to Coding Theory 


In order to give at least one example which illustrates the 
practical applications of the entropy concept, we now consider 
one of the simplest problems of coding theory. Suppose that 
the text which is to be coded consists of a sequence of symbols 
(letters) belonging to a finite set (alphabet), and denote by m 
the number of different symbols (so that the number of different 
sequences of length s is m’). We shall regard this text as a 
simple Markov chain of the type considered in the two preced- 
ing sections and assume that its statistical structure is known, 
in particular its entropy H. (We know that the maximum 
value of H is lg m.) We restrict ourselves to a consideration 
of the simplest case where the text at hand is coded into the 
same alphabet, i.e., each sequence from the text is coded into 
a sequence of letters from the same alphabet. (Actually, of 
course, it is only important that the coded text has the same 
number of symbols as the uncoded text, since no role is played 
by the designation of the symbols.) It goes without saying 
that the rules of coding must guarantee that the original text 
can be uniquely reconstructed from the coded text, which re- 
quires in particular that different sequences of the uncoded 
text must be coded differently. 

It is immediately clear that by using as short a coding 
as possible for the most commonly encountered sequences, and 
conversely, by leaving the longer coding for the more rarely 
encountered sequences, we have the possibility of making the 
coded text shorter than the original, which obviously might 
constitute a practical and economic advantage. In order to 
analyze this possibility, we must first of all choose quantities 
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which can measure in a natural way this kind of compression 
by coding. It is apparent at once that both the possible amount 
of compression and the choice of an optimum code to achieve 
it depend entirely on the statistical structure of the given 
text. 

Each s-term sequence (C) of symbols from the input text 
has a definite probability p(C), and the sequence of coded text 
into which it is transformed by the coding has a definite length 
o(C). The ratio o(C)/s can be regarded as a “compression coef- 
ficient” for the given s-term sequence. The mathematical 


expectation of this ratio 


_ SHC)AC) 


§ 


Hs 


(where the summation is over all sequences of length s) is the 
“average compression” for sequences of length s; finally, the 
quantity 
=I 

which we shall call the compression coefficient (of a given text 
for a given code) can clearly serve to measure in a natural 
way the compression of the given uncoded text by the given 
means of coding. In addition, we note that in all cases actually 
encountered the quantity », approaches a definite limit as s—co. 

We are interested in the smallest value of the compression 
coefficient that can be achieved by coding when the text has 
a given statistical structure. Naturally, at the same time we 
are interested in how to construct the corresponding “optimum” 
code. A complete answer to these questions is contained in 


the following remarkably simple theorem: 
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Theorem 4. 

If the entropy of the given text is H, then the greatest 
lower bound of the compression coefficient » for all possible 
codes is H/lgm, where m is the number of different symbols 
of the text. 

Thus, to find the greatest lower bound of the possible shorten- 
ing of a text by coding, there is no need to know in detail 
the statistical structure of the text; it is sufficient to consider 
only its entropy and the number of symbols it uses. Since lg m 
is the maximum value of H for the given number of symbols, 
the quantity H/lg m is sometimes called the “relative entropy” 
of the given text. 

Proof. 
We must show that 1) for any code wd and 2) for 


arbitrarily small 7>0, there exists a code for which #< H+. 
zm 


1) Choose an arbitrary code and let H’=H-—2y, where 7>0 





is arbitrarily small. We agree to call an s-term sequence of 
the given text a special sequence, if o(C)<H’s/lg m. Since the 
number of different k-term sequences of the coded text is m*, 
the number of all special s-term sequences of the text is no 
greater than 


Eo He 


mimi+e--+mUiew] Zminw {1+ +2 4.--1= ™ am, 
mm m m—1 





where a is the base of the system of logarithms used. We 
shall denote the sum of the probabilities of all the special 
sequences by 2, and show that 2,-0 as s>co. 

In fact, let e>O be an arbitrarily small constant; then, by 


Theorem 3, to obtain a sum of probabilities equal to ¢, we 
must take 
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N,(é) > gil -~ — gs’ +n 


of the most probable sequences. But, as we have seen, for 
sufficiently large s, the number of all special sequences does 
not exceed 
a 
m—1 


at! <q +7) 


Therefore, the sum 2, of these probabilities is less than «¢ if s 
is sufficiently large, i.e., 2,20 as sco, as was to be shown. 


Since for every non-special sequence (C) 





aeysaoe 
lg m 
the mathematical expectation of o(C) exceeds (1—2,) is : 
gm 


and therefore 


7 


Ms ‘S(1-/,)—— , 
lg m 


but since 4,>0 as s>oo 





p=lim Ls = H — H—2n . 
$-r00 lg m lg m 


Finally, since 7 is arbitrarily small, we have 


H 
w> 
lg m 





? 


which proves the first assertion of Theorem 4. 
2) Let »>0 be arbitrarily small; to prove the second 
assertion of Theorem 4, we determine directly a code for which 


pe Let 5>0 be arbitrarily small. The number of dif- 
zm 


ferent sequences of lenth s(H+8) equals 
lg m 
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sCH+8) 
Mm lem =—q*tt®. 





In addition, let ¢>0 be arbitrarily small: by Theorem 3, the 
number N,(1—e¢) of most probable s-term sequences the sum 
of which has probability greater than or equal to l—e is less 
than a“"*®, if s is sufficiently large. Therefore, all these 


“high-probability”’ s-term sequences can be coded using se- 
s(H+6) 
Ig m 
the latter to do so. As regards the remaining ‘low-probability” 





quences of length , Since, as we see, there are enough of 


s-term sequences (with total probability less than ¢), we simply 
code each of them into itself. In order to guarantee uniqueness 
of decoding, it is sufficient (for example) to put one of the 


previously unused sequences of length ia (but always the 
zm 


same one!) in front of each such s-term sequence of the coded 
text. For a code chosen in this way, the length o(C) of a 
s(H+ 8) s(H+6) 
lg m gm ’ 
where the first eventuality occurs with a probability less than 


coded s-term sequence will be either or s+ 


or equal to unity and the second with a probability less than 
or equal toe. Therefore, the mathematical expectation of o(C) 
does not exceed 





Leen e[ s+ s(H+8) (H+ 8) glee 


(I+e)-+e |<s5 
lg m lg m lg 


H+9 

zm 
if ¢ and 6 are chosen sufficiently small. 

Suppose now we have a large text (C) of length S=ks, 
where é& is very large. This text can be broken up into k 


sequences C,,C,,---,C, of length s; correspondingly, the coded 


text of length «a(C) falls into & sequences with corresponding 
lengths o(C,), o(C,),---o(C,), so that 
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a(C)=o(C,) +o(C,)+ ye +a(C;,). 


Therefore, the mathematical expectation of o(C) is k times the 
mathematical expectation of the quantity o(C,), and so by the 
foregoing, does not exceed 


hs( 242) —s corm, 
gm lg m 


It follows that 


and consequently that 


pee 
lg m 


» 


which proves the second assertion of Theorem 4. 


(Translated by R. A. Silverman) 
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a On the Fundamental Theorems of 
Information Theory 
(Uspekhi Matematicheskikh Nauk, vol. XI, no. 1, 1956, pp. 17-75) 


INTRODUCTION 


Information theory is one of the youngest branches of 
applied probability theory; it is not yet ten years old. The 
date of its birth can, with certainty, be considered to be the 
appearance in 1947-1948 of the by now classical work of Claude 
Shannon [1]. Rarely does it happen in mathematics that 4 
new discipline achieves the character of a mature and developed 
scientific theory in the first investigation devoted to it. Such 
in its time was the case with the theory of integral equations, 
after the fundamental work of Fredholm; so it was with. in- 
formation theory after the work of Shannon. 

From the very beginning, information theory presents mathe- 
matics with a whole new set of problems, including some very 
difficult ones. It is quite natural that Shannon and his first 
disciples, whose basic goal was to obtain practical results, were 
not able to pay enough attention to these mathematical diffi- 
culties at the beginning. Consequently, at many points of their 
investigations they were compelled either to be satisfied with 
reasoning of an inconclusive nature or to limit artificially the 
set of objects studied: (sources, channels, codes, etc.) in order 
to simplify the proofs. Thus, the whole mass of literature of 
the first years of information theory, of necessity, bears the 
imprint of mathematical incompleteness which, in particular, 
makes it extremely difficult for mathematicians to become 
acquainted with this new subject. The recently published 
general textbook on information theory by S. Goldman [2] can 
serve as a typical example of the style prevalent in this lite- 
rature. 
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Investigations, with the aim of setting information theory on 
a solid mathematical basis have begun to appear only in recent 
years and, at the present time, are few in number. First of 
all, we must mention the work of McMillan [3] in which the 
fundamental concepts of the theory of discrete sources (source, 
channel, code, etc.) were first given precise mathematical defini- 
tions. The most important result of this work must be con- 
sidered to be the proof of the remarkable theorem that any 
discrete ergodic source has the property which Shannon attri- 
buted to sources of Markov type and which underlies almost 
all the asymptotic calculations of information theory.* This 
circumstance permits the whole theory of discrete information 
to be constructed without being limited, as was Shannon, to 
Markov type sources. In the rest of his paper McMillan tries 
to put Shannon’s fundamental theorem on channels with noise 
on a rigorous basis. In doing so, it becomes apparent that the 
sketchy proof given by Shannon contains gaps which remain 
even in the case of Markov sources. The elimination of these 
gaps is begun in McMillan’s paper, but is not completed. 

Next, it is necessary to mention the work of Feinstein [4]. 
Like MeMillan, Feinstein considers the Shannon theorem on 
channels with noise to be the pinnacle of the general theory 
of discrete information and he undertakes to give a mathema- 
tically rigorous proof of this theorem. Accepting completely 
McMillan’s mathematical apparatus, he avoids following Shan- 
non’s original path and constructs a proof, using the completely 
new and apparently very fruitful idea of a “ distinguishable 
set of sequences’’, the principal features of which will be ex- 
plained below. However, Feinstein carries out the proof in 


* Sections 5-8 of this paper are devoted to this theorem. 
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all details only for the simplest and least practical case, where 
the successive signals of the source are mutually independent 
and the channel memory is zero. In the more general case, he 
indicates only sketchily how the reader is to carry out the 
necessary reasoning independently. Unfortunately, there remains 
a whole series of significant difficulties. 

As is well known, Shannon formulated his theorem on channels 
with noise in two different ways. One was in terms of a 
quantity called equivocation, and the other was in terms of the 
probability of error. McMillan’s analysis leads to the conclusion 
that these two formulations are not equivalent, and that the 
second gives a more exact result than the first. Feinstein’s 
more detailed investigation showed that although the first 
formulation is implied by the second, a rigorous derivation of 
this implication is not only non-trivial but fraught with con- 
siderable additional difficulties. Since both formulations are 
equally important in actual content, it is preferable to speak 
about two Shannon theorems rather than combine them under 
the same heading. 

In this paper I attempt to give a complete, detailed proof of 
both of these Shannon theorems, assuming any ergodic source 
and any stationary channel with a finite memory. At the 
present time, apparently, these are the broadest hypotheses 
under which the Shannon theorems can be regarded as valid. 
On the whole, I follow the path indicated in the works of 
McMillan and Feinstein, deviating from them only in the com- 
paratively few cases when I see a gap in their explanation, or 
when another explanation seems to me more complete and con- 
vincing (and sometimes, more simple). 


The first chapter of the paper, which is of purely auxiliary 
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character, requires special explanation. It is devoted to the 
derivation of a whole set of unrelated inequalities, each of 
which is a theorem of elementary probability theory (i.e., per- 
tains only to finite spaces). The reader acquainted with my 
paper [5] will be able to begin this paper with the second 
chapter, returning to the first chapter only when references 
to its results appear in the text. All the following chapters 
are constructed according to a specific plan, and can not be 
skipped or read in different order. 

The reader will see that the path to the Shannon theorems 
is long and thorny, but apparently science, at this time, knows 
no shorter path if we do not want artificial restrictions on the 
material studied and if we are to avoid making statements 
which we can not prove. 


CHAPTER I. 


Elementary Inequalities 


#1. Two generalizations of Shannon’s inequality 
Let A be a finite probability space composed of elementary 
events A; with probabilities 
(A) (1Zi Ln, (A) >0; P(A) =1). 
The quantity* 
H(A)=—3}p(A) le 2A) 


is ealled the entropy of the space A. The significance of this 
quantity as a measure of uncertainty or as the amount of 
information contained in the space A was illustrated by us in 
detail in [5]. Important properties of the entropy are enumer- 
ated there. 

Let us now consider, along with A, another finite space B, 
with the elementary events B, and the distribution p(B,) (1 Z 


kZm, p(B,)>0; S1p(B,)=1). The events A, and B, of the spaces 
k= 


A and B can be dependent. The events A;B, with the probabi- 
lities p(A,B,) can be regarded as elementary events of a new 
finite space which we shall designate by AB (or BA), and which 
we shall call the product of the spaces A and B. The entropy 
of this space is 


H(AB)= —3} S$) p(A.B,) lg W(A.B,). 


If it is known that the event A, occurred, then the events B, 


* In this paper all logarithms are to the base 2. 
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of the space B have the new probabilities 


24 (Bi) = BABY (k=1,2,-+-,m) 


instead of the previous p(B,). Correspondingly, the previous 
entropy of the space B 


H(B)=— 3} 0(B,) lg p(B.) 


is replaced by the new quantity 


H,(B)=— 31 p,(B,) le Pa (By), 


which, naturally, we shall regard as the conditional entropy of 
the space B under the assumption that the event A, occurred 
in the space A. A specific value of H,,(B) corresponds to each 
of the events A, of the space A, so that H,(B) can be regarded 
as a random variable defined on the space A. The mathematical 
expectation of this random variable 


H,(B)= >} (A) HB) 


is the conditional entropy of the space B averaged over the 
space A. It indicates how much information is contained on 
the average in the space B, if it is known which of the events 
of the space A actually occurred. In my paper [5], it is shown 
that 


(1.1) H(AB)=H(A)+ H,(B), 


a relation which is very natural from the standpoint of the real 
meaning of the quantities in it; in the special case where the 
spaces A and B are (mutually) independent, we have H,(B) 
= H(B), and therefore 
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H(AB)=H(A)+H(B). 


Shannon’s fundamental inequality (also introduced in my paper) 
is especially important for the purposes of this section. It states 
that for any finite spaces A and B 


(1.2) H,(B) £ H(B), 


the real meaning of which is that, on the average, the amount 
of uncertainty in the space B can either decrease or remain 
the same, if it is known which event occurred in some other 
space A. (The uncertainty of a situation can not be increased 
as a result of obtaining any additional information.) It follows 
from (1.1) and (1.2) that 


(1.3) H(AB) 4 H(A)+ H(B), 


an inequality which is easily generalized to the case of the 
product of any number of spaces; thus 


H(ABC) 4 H(A)+ H(B)+H(C), 
etc. The inequality (1.2) can be generalized in various direc- 
tions; we now prove two such generalizations, which will be 
needed subsequently. 


Let A and B be any two finite spaces. We keep all the nota- 
tion introduced above. In expanded form, (1.2) becomes 


3 (A) S) Pa (Bs) le ps (By) & > v(B,) le vB). 


For later use, it is important to show that this inequality re- 
mains valid when we sum both sides not over all, but only over 
certain values of the subscript k (but, of course, over the same 
values in both sides of the inequality). In other words, this 
inequality does not depend on whether or not the events B, 


form a ‘complete system ”. 
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Lemma 1.1. 


(14) 33 p(A) Sp, (By) lg p.,(B) SS w(B,) lg (B), 


where >}* denotes summation over certain values of the subscript 
k 


k (not necessarily all, but the same values in both sides of the 
inequality). 


Remark. 


In the special case where the summation is carried out over 
allk (12k 2m), the left and right sides of (1.4) become — H,(B) 
and — H(B), respectively, so that (1.4) agrees with (1.2). There- 
fore, (1.2) is a special case of Lemma 1.1, and hence is proved 
when it is. 


Proof. 

The function f(x)=2 lg x is convex for x>0; consequently, 
for x,>0, 4;X0, > 4;=1, the following inequality holds (see 
[5], [6]) 

(1.5) SAf@yssF( yaa): 
Putting 2,=p(A,), %:=D.1,(B,), we find 
SpA.) S* (Bd le p.(B)= S133 ADS [Pa (Bo)) 


7 {5) (4) p,(Bd}=30" F[PB)]= Li Be le WB), QED. 


Now we generalize the inequality (1.2) in another airection. 


Lemma 1.2. 


For any three finite spaces A, B, C 
H,,(C) 2 H,(C). 
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Remark. 

In the special case where the space B consists of one event 
with probability 1, the space AB coincides with A, and H,(C) 
=HA(C). For this special case, the statement of Lemma 1.2 
becomes 


H,(C) <H(C), 


and is equivalent to the inequality (1.2), which is therefore a 
special case of Lemma 1.1. 


Proof. 

Let a given event B, occur in the space B; then all the 
events A,C, in the product space AC have the probabilities 
P2,(A.C,)=q(A.C,); in just the same way the space A becomes 
the space A’ with the probabilities Ps,(A,)=q(A,) and the space 
C becomes C’ with the probabilities p,,(C,)=9(C,). Hence, accord- 
ing to inequality (1.2) 

(1.6) AAC’) Z HC’). 
But 


(1.7) A(C’)= == q(C,) lg g(C) = 72 Da, (C,) Ig Pp (C,) = Hi, (C). 
On the other hand 


A,A(C)= 2 q(A,) 2 Ya(C,) lg G4 (Ci), 


where 
QA, WAG) _ Pra (AC,) 
qi q1,( C)= Cr TEES Er ITE =Ds,4 (Cr) ; 
HA) -Bs,(AD 
consequently 


(1.8) A,(C’)= 72 Pn (Ad 2 Pn aC) lg Pra.a dC.) 
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=D Ds,Ai) Hig, 4 {C). 
Substituting (1.7) and (1.8) into (1.6), we find 
3 Po,(A) Hays (C) 2 Ha,(C). 


Multiplying both sides of this inequality by p(B,) and then 


summing over all k, we obtain 


2 P(B,)Ps6A:i) Apa {C) Zz 2 P(B,)Hz,(C), 


or 


> D(A,B,) Hy 3, (C) <4H,(C), 


or, finally 
H,,(C)ZH;(C), Q.E.D. 


#2. Three Inequalities of Feinstein [4] 


We again consider two finite spaces A and B and their 
product AB. Let Z be some set of events A;B, of the space 
AB, and let U, be some set of events A; of the space A; let 
6,>0, 6, >0, and 


W(Z)>1-6,;  p(Uy)>1—s. 


Let us denote by /’; (12127) the set of events B, of the space 
B for which A,B, does not belongto the set Z. Finally, let U, 
be the set of events A,¢U, for which p,(J’;)Za. Then we 
have 


Lemma 2.1. 


rE eee eer vest 
a 


40 KHINCHIN 
Proof. 
Let U, be the set of events A; for which ps (ls) >a, so that 
(2.1) ,= U,— U U3. 
If A,e U,, then 
DAL )= (Ai) ps0) > ap(A,), 


which means that 


o> Ar.)= 3} WAM) >a >) W(A)=ap(U:). 


iGU, ie, 


On the other hand, since all the events A,J’, are incompatible 
with the event Z, the probability of which exceeds 1—5,, then 


p( 3) Ad") 41—P(Z) <8, 
A;EU, 


and we find that ap(U,)<6,, or p(U;)<&,/a. It follows, a for- 
tiori, that 


UU) < * ; 
a 


hence, by (2.1) 
p(U,)= p(U,)— p(U, U2) > 1—6, —8,/a, Q. E. D. 


For a given subscript k (12k 2m), we now designate by 1, 
the value of the subscript ¢ (127127), for which the prob- 
ability »(A;B,) assumes its greatest value (in the space AB); 
if there is more than one such 7 value, we take any one as 1,. 
Thus A,, is the event of the space A which is most probable 
for a given event B, of the space B. Obviously, the sum 


P= pa = p(A;B,) 


iyi, 
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is the probability (in the space AB) of the appearance of a pair 
of events A,B, such that A, is not the event of the space A 
which is most probable for a given event B, of the space B. 
Clearly, it is possible to write 

P=p(iFi,). 
Lemma 2.2. 


If for a given ¢ (0<e<1), a set 4, of events B, can be as- 
sociated with each A; (1ZiZn) such that 


1) p(4,4;)=0 (iJ) 

2) Pal4)>l—-e (lZiZn) 
then Pe. 
Proof. 


Obviously, we have 
(2.2) = P=1—3)p(A,,B); — 1-P=3) 0(A,,B,). 


Let us denote by 4, the set of events B, (if such exist) which 
are not in any of the sets 4; (12iZn); then, clearly, the 
range of summation in the last sum can be expanded into the 
parts 4), 4;,°-+,4,, and therefore 
1—P= > 23 P(A; Bi)+ >a A,B) <> ps D(A, B,). 
Bea; Bea; 

According to the definition of the subscript 7,, the right side 
of this inequality can only be decreased if we replace the sub- 
script 1, in each term of the sum by any other subscript from 
1 to n. Thus, in particular 


1— PLS SWAB.) =>) (44) = S w(AP« (4) 


=1BypG4; 


> (1-2) 3) (A) =1—«, Q. E. D. 
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As before, let » denote the number of elementary event A; of 
the space A. 
Lemma 2.3. 

For n>1 


H,(A) ZP lg (n—1)—P lg P—(1—P) lg (1—P). 


Proof. 
As above, for brevity, we write f(x) for <lgz. We have 


H,(A)=— > p(B,) DS [p,(Ai)] =f,+H, 


where 


M,=— >) w(Bi) f [p5,(Ai,)I, 
H,= I P(B,) at [p2,(Ai)] . 


Putting 4.=p(B,), %,=Ps,(Ai,) in the inequality (1.5), we find 
by (2.2) that 


23) H=-Dafe)Z—S( Sase)=—-F(Z vB) ,(A,)) 


=—f( 33 0(B.A,))=-S0-P)=-(-P) lg (1—P). 


A similar application of the inequality (1.5) yields because of 
(2.2) 


2.4) —3} p(B)/1—Pn(A,)] S— F(Z BP (Ay)]) 
=—f(1- 5 (B.A,))=—S(P)=—P le P. 


Furthermore, since 


T= Ds (Ai,) za Dn,(Ai), 
r tp 
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then for any fixed k (12k 2m) 
fl1—pz,(A,,)] pe Ps,(A;) lg p> Pn, (A:) 
t tp tu “tp 


>> Ds,(A,) 
=(n—1)f| —*—___ ] + le (n—1) 31 95,(A)). 
n—1 iFi, 
We again use (1.5) to estimate the first term on the right 
side by putting this time 4,.5—1_, 2,=p,(A) (Zi Zn, i#i,). 
nu— 


1 
This gives 


= Ds,(A,) 1 
pe | a A 
n—1 ii, n—1 
and, therefore 
Fl1—Pa(4i)] ZO Fl Ps,(Ad] +18 (»—1) & P,(Ad. 

tp v a7 
Multiplying all the terms of this inequality by p(B,) and sum- 
ming over k from 7 to m, we find by (2.2) that 
(2.5) 2 AB) F(1—p3(Ai,)] 2> p(B) > fl ps,(Ay)] 

k a k 


+P lg (n—1)=—H,4+P lg (n—1). 
Finally, combining (2.3), (2.4), and (2.5) term by term, we find 
A, 2 —(1— P) lg (1—P)—P lg P—H,+ P lg (n—1), 
from which 


H,(A)=H,+ H, < P lg (n—1)—P lg P—(1—P) lg 1—P), 
Q. B.D. 


CHAPTER IL. 


Ergodic Sources 


#3. Concept of source. Stationarity. Entropy 


In statistical communication theory, the output of every 
information source is regarded as a random process. The statis- 
tical structure of this process constitutes the mathematical 
definition of the given source. In this paper, we shall deal 
exclusively with discrete sources (processes with discrete time); 
the output of such a source is a sequence of random quantities 
(or events). Consequently, we must understand the definition 
of a source to be the complete probabilistic characterization of 
such a sequence. 

Underlying the definition of every source is the set A of 
symbols used by it, which we call its alphabet, and which we 
always assume to be finite. The separate symbols of this alpha- 
bet, are called its letters. Let us consider a sequence of letters, 


infinite on both sides. 
(3.1) C= (+++, 4, Lo, Ly, Vay? * *), 


which represents a possible “life history ” of the given source. 
We shall regard it as an elementary event in a certain (infinite) 
probability space, a space the specification of which character- 
izes the sequence (8.1) as a random process. The set of all 
sequences (3.1) (i.e, the set of all elementary events of the 
given space) will be denoted by A’. Any subset of the set A’ 
represents an event of our space, and conversely. Thus, for 
example, the event ‘the source emits the letter a at time ft 
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and the letter @ at time u” is the set of all sequences x of 
the form (8.1) for which 


t=; =f. 


Generally, if t,,t.,---,t, are any integers and a, a.,---,a@, are 
any letters of the alphabet A, then the event ‘the source 
emits the letter a, at time t; (12i2n)” is the set of all x 
for which 


b= a: 


=% (1 ZizZn). 

Subsequently, we shall call such a set of elementary events 
x a cylinder set, or briefly a cylinder. As is well known, it is 
sufficient to know the probability y»(Z) of all cylinders Z to 
define the sequence (3.1) as a random process. Let us consider 
the set of all cylinders of the given alphabet A and its Borel 
extension F',, i.e., the intersection of all the Borel fields which 
contain all the cylinders of the alphabet A. Then, giving the 
probabilities »(Z) of all cylinders Z uniquely determines the 
probability u»(S) of any set Se F’, of elementary events x. Thus 
a complete description of the source as a random process is 
achieved by specifying 1) an alphabet A, and 2) a probability 
measure u(S) defined for all Se F,. In particular, we always 
have »(A’)=1. Since the alphabet A and the probability measure 
p» completely characterize the statistical nature of the source, 
we can denote the source by the symbol [A, ,]. 

In his basie work [1], Shannon considered only sources with 
the character of stationary Markov chains; the characterization 
of such sources is achieved by more elementary means. The gen- 
eral concept of source just given is due to McMillan [3]. 

Consider any given sequence x of the type (8.1) of letters 
from the alphabet A, and denote by Tx the sequence 
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, t of 
Tx=(- ot, Uy 5 Xi, Vo, ** -), 


where 27 =2,,, (_«o<k<-+ 0) (so that the operator T denotes 
the “shift” by one time unit). If S is any set of elements 2, 
then the set TS is the set of all Tx for which xeS. (In other 
words, the relations x¢S and Txe TS are equivalent.) It is easy 
to see that if Sek, we have TSeF,; it is also obvious that 
the operator 7’ maps the set A’ into itself 


TA'=A', 
If 
u(TS)= pS) 


for any set Se F’,, then the source is called stationary. Evidently, 
by the stationarity of a source is meant the time invariance of 
the probability regime of its output. All the sources considered 
below will be assumed to be stationary. 

From the information theory viewpoint, the most important 
eharacteristic of every source is the rate at which it emits 
information, i.e., the average amount of information given by 
one emitted symbol. We now show how to arrive at an exact 
definition of this quantity. Consider a sequence of 7 successive 
symbols emitted by a given source; let this sequence be 2,, 
Hiei **yU.n-1 If the source alphabet A contains a letters, then 
the number of such different n-term sequence is obviously a’. 
Every such sequence C is a cylinder in the space A’ (i.e., the 
set of all xe A’ for which the 2,,---,2%,,,., assume the fixed 
values characterizing the given sequence), and therefore has a 
definite probability u(C). Thus, the set of all possible -term 
sequences of the type described represents a finite probability 
space consisting of a" elementary events C with probabilities 
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u(C). In chapter I we agreed to measure the amount of in- 
formation contained in such a space by its entropy 


= — HC) lg p(C). 


If the given source is stationary (as we shall always assume), 
then the probabilities »(C), and therefore the entropy H, of an 
n-term sequence do not depend on the “initial moment” t, and 
are uniquely determined by the nature of the source and by 
the number n. Thus we can say that the sequence of n symbols 
emitted by the source gives a well defined amount of informa- 
tion H,, which depends only on n and on the nature of the 
source, so that, on the average, the amount of information per 
symbol emitted by the source is H,/n. Consequently, it is 
natural to agree to call the quantity 

H=lim HT, 


nro YW 





the source entropy, i.e., the average amount of information 
conveyed by one emitted symbol (if, of course, the limit exists). 
Clearly, the entropy H as thus defined, depends only on the 
nature of the source (i.e., on the alphabet A and on the pro- 
bability distribution »). The fact that the entropy really ex- 
ists for every stationary source is the first fundamental theorem 
of the general theory of discrete sources. We turn to its proof.* 

The space A,,,, of sequences of length n+-m (where n and 
m are positive integers) can be regarded as the product of the 
space A, of sequences of length nm and the space A,, of se- 


quences of length m (see #1). Hence, according to #1 


* As far as I know, the first proof was given by McMillan [3]. However, McMillan had 


in mind a broader aim, so that his proof is considerably more formidable than the one given 
below. 
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H(A, .n)=H(A,)+H(An)i  Ha,(An) £ H(An), 
from which 
H(A,) 2 H(A,.n) Z H(A,) + H(An). 
In our new notation, this can be written as 
Vs pay « Bw « a ad: Pe 
In particular, the first of these inequalities yields (for m=1) 
(8.2) Ae Ay 


and the second is easily extended to any number of terms and, 
in particular, gives for any integral k 


(3.3) H,. oa kH,.. 
Setting n=1 in (8.3), we find that for any kN1. 
Hi, ZkH,, 


which shows that 





elim int 2S ea: 
noo nN 


Now let e>0 be given arbitrarily, and let the subscript q be 
chosen such that 


Hy <at+e, 
q 


For any n>q, we determine an integer k>1 such that 
(k—-l)g<nZkq. 
Then, because of (3.2) 


Consequently, by (3.3) 
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Fy My 7 bk By ke 
n ~ (k-1)q” k-1 @g k-1 


and therefore, for sufficiently large n» (and hence for appro- 


(a+¢), 





priately large k) 





a—é < fn < ad (a+e)<a+d2e. 
n k-1 
But since ¢ is arbitrarily small 


lim =a, Q.E.D. 


n>?0o nN 





24. Ergodic Sources 
The set S of elements x¢ A’ is called invariant if TS=S, 
ie., if the “shift operator” T carries the set into itself. The 
set A’ is always invariant. For any x¢A’, the set of elements 
-+, T-'a, 2, Tx, T’x,--- is always an invariant set. The source 
[A, »] is called ergodic if the probability u(S) of every invariant 
set Se F, is either 0 or 1. The ergodic property is very im- 
portant in the study of the statistical structure of the source. 
This results from the following considerations. Each (numeri- 
cally valued) function f(x) of the elementary event xeA’ can 
be regarded as a random variable defined on the space A’, and 


conversely. If the abstract Lebesgue integral 
(4.1) f | f(x) | duls) <0, 
At 
then this random variable has the mathematical expectation 


MF(a)= [ f(@) du(a). 


The well-known “ergodic theorem” of Birkhoff states that 
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for any stationary source [ A, »] and for every summable func- 
tion f(x) (ie., satisfying the requirement (4.1)) the limit 

Ey 1 n-1 

lim — $3 f(T*x)=h(2) 

nao NM k=0 
exists almost everywhere (i.e., with probability 1) where the 
function h(x) is invariant, i.e., h(Tx)=h(x) for all x for which 


h(x) exists. If the source [A, »] is ergodic, then almost every- 
where (with probability 1) 


h(w)=M f(x); 
thus, in the case of an ergodic source, the ratio 
St f(T*2) 
k=0 
n 


approaches the mathematical expectation of f(x) as n> , for 
almost all x. 

Now let g;(x) be the characteristic function of the set Se F, 
(i.e, gs(a)=1 (weS) and g,(x)=0 (x€S)). Obviously, g(x) is 
summable and Mg,(x)=,n(S). The sum >) gs(T*x) is the number 
of terms of the series x, Tx,---, 7'"~'x which belong to the set 
S. Let us denote this number by ¢,, (x). Then the Birkhoff 
theorem states that in the case of an ergodic source, we have 
for almost all x 


lim Pus) — wS). 


Thus, in this case, the proportion of the terms in the sequence 
x, Tx, T*x,--- 


which are elements of the set is just the probability of the set 
S, for almost all such sequences. 
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Let us agree to say that the source [A, »] reflects the set 
Se F,,, if almost everywhere (with probability 1) 


lim +5) 9 (T*x) = (8). 

n>0 N k=0 
We have just seen that an ergodic source reflects any set Se Fy. 
It is easy to convince oneself that the converse holds: If the 
stationary source [A, »] reflects any set Se F’,, then the source 
is ergodic. Indeed, if the [A, u] source were not ergodic, then 
there would exist an invariant set Se F', for which 0<,(S)<1. 
Because of the invariance of S, we would have TxeS, T’xeS, 
++» for any «eS and, therefore g,(T'x)=1 (k=0,1,2,---). Con- 
sequently 


n-1 
lim $3 g6(T"2)=1F u(S) 


noo 


for any xeS. Since u(S)>0, then the set S is not reflected 
by the source [A, »], Q. E. D. 

A somewhat stronger theorem will be needed below: In 
order for a given stationary source to be ergodic, it is sufficient 
for it to reflect all cylinders of the space A’. To see this, it 
is enough to show that the sets Se F,, which are reflected by 
a given source, generate a Borel field, for, if this field contains 
all cylinders of the space A’, it contains the whole field F',, by 
the very definition of the latter. But this means that the 
given source reflects any set Se F’, and, as we have just seen, 
this is sufficient for the source to be ergodic. 

Let us denote by G the class of all sets Se F, which are 
reflected by the given source; in our special case, G contains 
all cylinders of the space A’. If the sets S,,S.,---,S, are non- 
overlapping, then obviously 
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Is, 2-45(O=Is (B+ +++ +95,{%), 
whence it follows at once that S,¢G,---,S,¢G implies S,+--- 
+S,¢G. Furthermore, if S,CS,, then 
9s,-s(¥)=9s5,(©)—gs5,(2), 

whence it is easy to see that S,—S.¢€G follows from S,¢G and 
S.eG. In order to convince ourselves that G is a Borel field, 
it remains only to show that the rule which we established 
above for a finite sum of sets S,+---+S, remains valid for an 
infinite sum 3S, of non-overlapping sets. For this purpose 
we need the following auxiliary proposition. 
Lemma 4.1. 

For any ¢>0, no matter how small, there exists a 5>0 such 
that if the set Ue F, has probability less than &, then u(S.)<é, 
where S, 1s the set of values of x for which h,(x)><. 


Proof. 
For xéS8S, 
lim Sg. A Pehle) Se: 


neo 


Hence, if n is sufficiently large, we have 

(4.2) SoT'e)>Sn 

for all points x of a set S/ with probability 
WS) = aS.) 

From (4.2) we find by integrating over x 


a3) 0S J G(T) dul) = w(S2)> £2 (S,). 
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But, on the other hand, because of the stationarity of the given 
source 


1 go( Te) due) = fg (T') du T"2) < WU) (OZk<n), 


and (4.8) gives 


<P wlS.) Zmn(U). 
If we put 6=— then the inequality 


u(S,)<€ 
follows from »(U)<6, which proves the lemma. 


Now, let S;¢G G@=1,2,---), SS,=S and let the sets S, be 
non-overlapping so that the series 
Sy lS.) = nS) 

converges. We put >)S,=Uy, so that 

i>N 

wUn)>0  (N>0). 
Since S=S,+ ee ee +Sy+ Ux then 

9s(%)=Gs5,(e)+ Page +95(%)+9u,(*)- 


and therefore 

h;(x)=hs (")+ ada ths (") thy, (x). 
But we have hs (x)= y(S,) (121.2 N) with probability 1. There- 
fore, with probability 1, we have 


hs(e)= (33 S,)-+hy (2) 
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But by lemma 4.1, for sufficiently large N and with probability 
arbitrarily close to unity, we have 


hy (4) <6, 


where €>0 is arbitrarily small. Since h;(x) is independent of 
N and ¢, then with probability 1 


h(x)=n(S), 
and therefore SeG. Thus, G is a Borel field, a fact which, as 


we have seen, completes the proof that every source which re- 
flects all the cylinders of the space A’ is ergodic. 


#5. The E property. McMillan’s theorem. 


If the alphabet A of a given source contains a letters, then 
the number of different ‘‘n-term sequences” 


Vip Vente? oe Ur an-d 


which can be emitted by the source is a”. As already stated 
in #3, these sequences C can be regarded as the elementary 
events of a finite probability space. The probabilities y(C) of 
these elementary events are determined by specifying the source 
itself, since every sequence C is a cylinder of the space A’ and 
has the definite probability »(C) (which in the case of a sta- 
tionary source is independent of t). Thus, the different sequences 
C can also be regarded as compound events (cylinders) of the 
infinite space A’. As long as » remains constant, the first point 
of view is preferable, since it is considerably simpler than the 
second. But if 7 varies in our considerations, then the first point 
of view becomes inconvenient, since it forces us to consider a 
separate space for every value of 7; in such cases it is usually 


more advantageous to use the second point of view, since it 
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allows us to consider all the events as occurring in one and the 
same space A’. 

As we have already noted in #3, every (numerical valued) 
function of the letters 2,, %,..,°°+,%:.2-, can be regarded as a 
random variable on the space A’ of the given source, since the 
letters 2, %,.15°°*) Uin-1 a8 Well as the value of the given func- 
tion are uniquely determined by specifying the elementary event 
of the space A’ (w=---,2%_,,2%,%,°°:). In particular, —(1/n) 
Ig »(C) is such a random variable, where C is a random se- 
quence %,, 2,43) °**) L;4n-1 With given n and t. (Of course, lg u(C) 
is independent of ¢ if the given source is stationary.) Clearly, 
this random variable assumes the same value for all x having 
the identical sequence C=2,,%%,.1)°°*,%an-1 (i.e. belonging to 
the cylinder corresponding to this sequence). Consequently, the 
mathematical expectation of this quantity can easily be found by 
elementary means, i.e., by multiplying its value on each cylinder 
C by the probability »(C) of the cylinder and adding all the 
products. This gives 


M(-2-1g «C))=-=-3 n(C) le w(C). 
n nm ec 


Here, we recognize the sum —}5}p(C) lg u(C) (see #8) as the 
Cc 

entropy of n-term sequences from the given source, which we 

denoted by H,. Thus we find 


M| -—-lg w(C) |= “ 





Assuming the given source to be stationary, we set ¢=0, so 
that hereafter C denotes the sequence 2%, %,,-+-,%,-; The 
random variable —(1/n) lg »(C) is then a function of x and n, 
which we denote by /,(x); thus 
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H, 


n 
n 


Mf,(%)= 





In £8 we showed that for any stationary source the ratio H,/n 
approached a definite limit as n->co, which we agreed to call 
the entropy (per letter) of the given source. Thus we find for 


any stationary source 
Mf,{%)>H (n>), 


i.e., the mathematical expectation of the random variable f,(x) 
=—(1/n) lg p(C) approaches the entropy of the given source as 
Nn > o, 

The fact of primary significance for all of information theory 
is that a considerably stronger statement holds under certain 
simplifying assumptions about the nature of the source: Not 
only does the mathematical expectation of f,(x) approach H as 
a limit as n>, but f,(x) itself converges in probability to H 
as n—>co, This means that, for arbitrarily small ¢>0 and & 
>0, the probability of the inequality |f,(~)—H|>« is less than 
& for sufficiently large n. This can be said still more descrip- 
tively as follows, by recalling the definition of f,(x). For 
arbitrarily small e>0 and &6>0, and for sufficiently large n, all 
the n-term sequences C in the output of the given source can 
be separated into two groups, such that 

1) For every sequence C of the first group 


| BHO pH <e. 
n 


2) The sum of the probabilities of all the sequences of the 
second group is less than 6. 

Let us agree to call the first group the ‘high probability 

group”; and the second group the “low probability group”. 
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The high probability group is characterized by the fact that 
(1/n) lg #(C) is close to—H for all its sequences C, so that the 
probability y«(C) of such a sequence is approximately 2°"”. 
Hence, all the sequences of the high probability group have 
approximately the same probability 2°"”, which means that the 
number of sequences in this group is approximately 2””. If we 
recall that the number of all n-term sequences is a*=2"!** and 
that H is always<lga, them we see that, generally speaking 
(more exactly, with the exception of the case H=lga), for 
large n the high probability group contains only a negligibly 
small share of all the n-term sequences from the source. On 
the contrary, the overwhelming majority of such sequences fall 
into the low probability group. 

We shall call the source property just described the F pro- 
perty. As already mentioned, it is of fundamental significance 
in information theory. Therefore it is important to find the 
broadest possible class of sources possessing this property. For 
sources such that each letter is statistically independent of all 
the preceding letters, the H property is an almost immediate 
consequence of the law of large numbers, and always holds; 
however, in practice, we seldom encounter such sources. As 
Shannon showed (see [1], also [5], Theorem 3), all sources of 
the simple ergodic Markov chain type also possess the LE pro- 
perty, and this proof is easily carried over to the case of 
compound Markov chains of any order. Finally, in 1953, McMillan 
succeeded in proving that any ergodic source possesses the # 
property. This important theorem permitted for the first time 
the construction of a mathematical] basis for the general theory 
of information with sufficiently broad assumptions on the sta- 


tistical nature of the transmitted information. We shall give 
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a detailed proof of McMillan’s theorem in subsequent sections. 
In doing so, we shall refer to certain ‘ergodic theorems” of 
a general character, which, like Birkhoff’s theorem (already 
cited more than once), we must assume to be known to the 


reader, 


#6. The martingale concept. Doob’s theorem. 


In recent years, the concept of martingale, introduced by 
Doob, has been useful in various problems of probability theory. 
Here, we must become acquainted with this concept in a rather 
limited form corresponding to our needs. 

Let 


(6.1) Ey, Gap) Emre’ 


be a sequence of random variables defined on the space of 
elementary events x¢ A’. (The sequence of functions f,(x) which 
we considered in the previous section is such a sequence.) In 
general, the €, are mutually dependent, and, for m>1, we can 
speak of the conditional mathematical expectation of &, for 
given values of €,,:--,&,_; Let us agree to denote the con- 
ditional mathematical expectation of &, for &,=4a,, &=4@s,++:, 
Sm-1=— Gna by My ja.nan (Em) 

The sequence (6.1) is called a martingale if for any m>1 


and any a, 
(6.2) Mie, He Oat 


We call the martingale (6.1) bounded if the random variables 
€,, are all bounded (|&,,|<C for any x and m). In what follows, 
we deal only with bounded martingales, so that the question 
of the existence of the mathematical expectations which we 
need causes no difficulty. 
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By the “sequence a,,-:-,a,”, where mZn, we mean the 
set (cylinder) of those x¢A for which §,=a, (mZiZn). Let 
A,,-1 be the sequence a,,:--,a@,,_,; then 


_ 4 
Mo,--1y- w= Tq J : (a) dyu(z), 


and the characteristic requirement (6.2),-defining the martingale, 
can be written as 


ff Fnl@) du(t)=aq_1u(Am1). 


Am-1 


The fundamental theorem of Doob [7], which we need, is that 
every bounded martingale converges with probability 1 (almost 
everywhere in A’). We need two auxiliary propositions in 
order to prove it. 


Lemma 6.1. 
Let &,, €s,+++,€n,+++ be a bounded martingale, and let A, be 
one of the sequences a,,G.,--+,a;, where m>Jj. Then 


(6.3) 5: €, dul) = f & dpa) =a,p(A,). 
Aj Aj 
(In other words, the conditional mathematical expectation of £,, 
is a,j, for given a,,-++,a;). 
Proof. 


For m=j+1, (6.3) follows from the definition of a mar- 
tingale. Consequently, let m>j+1. Then 


where the summation is over all sequences C, of the form a,.,, 
-+,@m-; For any sequence C,, the intersection A,C, is some 
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sequence d,,---,@,_;, 80 that, by the martingale definition 


ff Fn dpla)= dq yWlA,C,)= ff Fn-1dulz). 


AjCy A;Cs 


(since €,,_;=@,-, for any x¢€A,C,). Summing this equality over 
all the sequences C,, we find (for m>j+1) 


i &, du(x)= ap &,_, dye). 


Clearly, the successive application of this recurrence relation 
leads to (6.8) and Lemma 6.1 is proved. 


Lemma 6.2. 


Let n>m>0, let N be a set of sequences a,,+++,@,,, and let 
A be the set of all sequences a,,:-:,a, for which 


min a,Zk, 
majan 


where k is any given real number. Then, for r>n 


J EA2) due) Zk WAN). 
AN 
Proof. 
We denote by A, (m2ZjZn) the set of all sequences a,, 
-++,a, for which 


On >k,- Satis a;.1>k, a, Zk. 


Clearly, the sets A; (mZj Zn) are non-overlapping and have 
the set A as their union. It is also clear that each A, is a 
certain set of sequences a,,---,a, (i.e., whether x¢A, or not 
is completely determined by the values of the quantities &,(x) 
=4,,--+,€(%)=a,). We have 
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(6.4) f &due)=>5 fo &, dun). 
AN ner A;N 


But each A,N is, evidently, a certain set of sequences C;” of 
the type a,,---,a,, A;N=S3C{”, so that by Lemma 6.1 for 
r>n 


f &aw=3 f &du@=D fe dulxy= f &, daz). 


(> (ep) 
c;? c,? 


Since ¢;=a,<k on the set A,, then 
ff Sedu) Zk WAN) (mZj2n), 
A; 


and, therefore, by (6.4) 
J "2 du(x)Zk (AN), QED. 
AN 


Remark. 
It is clear that 


ff & dla) Sk WAN) 
AN 
-can be proved in a completely analogous way if A is the set 
of all sequences a,,,---,a, for which 
max a,;k. 
mijin 


Doob’s Theorem. 

Every bounded martingale (6.1) converges with probability 
1 (t.e., almost everywhere in A’). 
Proof. 

Suppose the theorem is incorrect. Then, real numbers k, 


and k, can be found such that 


lim inf €,<k,<k,<limsup é, 
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on a set A of positive probability. We put n(A)=7>0. Clearly, 


we can select a positive integer n, so large that 


max €,\k, 


1ZjZn, 


on a set A, for which 
HAA) > (1-2). 


Furthermore, if 7. is large enough, we have 


min &, 2k, 


Ny ZILN 


on a set A, for which 


wd.) > (1 ) 





again, if ”, is large enough 
max §&,\k, 


Ng ZIZNg 


on a set A, for which 


HAAs)>n(1-<-); 


ete. We continue this alternating process indefinitely. We put 
A,Ag:++A,=M, (k=1,2,-+-). 
For any kN1, we have 
uM) Sw AM,) = A)— AME) =1— un] 33(A— AAD} 
Sn SJ A= AA) =9— [Wl A)— Wl AA)] 


x 1 ‘I 
s1-33[ »-9(1-3) J=a-a 213i ot 


Furthermore, we have for any rX 1 


M,,= As,Mo,-1 7 
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It is clear that in this expression M,,_, is some set of sequences 


Gy, Qo,°**,@ and A,, is the set of sequences a, ,:+:, @ 
oT — 


"or -1 1 "Or 


for which 


min a,Zk,. 


Jj 
Nop: ZIEN9, 
Thus, we see that if we put m=n,,_,, N=N2,, N= Mhp,_1, A= Ao, 
k=k, in Lemma 6.2, then all the premises of this lemma are 


fulfilled, Therefore, if m>n.,, Lemma 6.2 gives 


[ & a Aut) 2 kyu(M,,). 
"Mes 
In a completely analogous way (see the remark after the 


proof of Lemma 6.2), we have 


[En dh) S hap Mo, 1). 


Mor-) 


Therefore, putting M,,_,—M.,=Q,, we find 


f Fn dulay= fo En dale) fo En dul) X heap Mey.) — hen Me,) 
Q, 


Mor—1 Mor 
= (Ky — ky) (Me, -1) + ky [ w(Me, -1) — eM) ] > 
> (es— es) > + Fn(@,). 

Since the sets Q, are non-overlapping, the series > L(Q,) 
converges, and «(Q,)>0 as r>oo. But since the left side of the 
last inequality does not exceed Cyu(Q,) because | €,,(%)|<C, then 
both the left side and the second term of the right side of 
this inequality are infinitesimally small as r+, whereas the 
first term of the right side is a positive constant. Thus we 
arrive at a contradiction which proves Doob’s theorem. 
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#7. Auxiliary propositions 

We have already remarked repeatedly (## 3,5) that every 
quantity which can be uniquely determined by the sequence 
Lisi *y Li4n-1 Of letters from the alphabet A can be regarded 
as a function of the elementary event 


t=: +) By, Loy Uyy° °° 


of the space A’ or, equivalently, as a random variable on this 
space. In particular, in #5 we defined such a function, viz. 
f,(#)= —(1/n) lg u(C), where C is the sequence %, %,-°++, %,_1. 
Now we must become acquainted with some other functions 
of the same kind. Let us agree to denote the random sequence 
%_ny'**,@_, by C,, and the sequence 7_,,-+-,%_,,% With one 
extra letter by C,+2,. Each of these sequences is an event 
(cylinder) of the space A’, and every quantity which is uniquely 
determined by the letters x_,,---,%_,,% is a random variable 
on this space or, equivalently, a function of the elementary 
event x. In particular, the ratio 
— WC, +%) 
oes u(C,,) 
is such a quantity, and can obviously be regarded as the condi- 
tional probability of the appearance of the letter x, after the 
appearance of the sequence C,=2_,,:++,%_;. In addition, let 
us put p(%)= (2). 
Let a be any fixed letter of the alphabet A, and put 


u(C, +e) 
u(C,) 


This quantity (the probability of the appearance of the letter 


DAL, &)= 


a after the sequence C,) differs from p,() in that in >p,(x) 
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the letter x, depends (just as do the letters v_,,---, %_,) ON &, 
whereas the letter a which replaces it in P,(x, qa) is fixed and 
independent of «. 


Lemma 7.1. 

The sequence p,(x,a) (n=0,1,---) is a martingale. 
Proof. 

For brevity, let us put p,(%,a)=€,. Consider any fixed 
sequence of n—1 letters a_y,-+*,@-¢n-1 and denote by B,_, the 
cylinder 2_,=@_4)°+*) © ¢n-1)=@-cm-1) Of the space A’. Further- 
more, let J"; denote the cylinder z_,=8, where 8 is any letter 
of the alphabet A. Since Dy=A", then 


fidu@=3 fb duo), 


Buoy Bal’ 


but, for x¢B,_,I';, we have 
= u(B, T's +a) ‘ 
w(B, 11's) 
Therefore 
f Fe dul@)= BD w(By Dy +a)= By 1 +0). 
B 


n-1 


But according to the definition of the random variable 


&4= u(C,_1+@) , 
u(C,,-1) 
its value [&,_:],,_, at C,_1:=B,_, (ie., when the random cylinder 
C,,-, coincides with the cylinder B,_, which we chose) equals: 
ee. — w(B,-1+@) . 

ens <a Bea 
Therefore 
(7.1) f Fn due) = [Ent] 9, 1a(Bu-) 


Bn-1 
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Now, let K,_, be the set of all x for which ¢,,---,&,_, take 
any given system of values &,=7, (12ZiZn—1). Since the 
numbers 7, (12iZn—1) are uniquely determined by the 
selection of the cylinder B,_,, the set K,_, is the union of 
several cylinders B,_,;, where, of course, [&;]s, ,=(&iJ)xe,_:=7% 
(1ZiZn—1) for all the cylinders B,_, in K,_;. Thus it 
follows from (7.1) that 


fedu@= Sf &due)= me ra(By)= 7 o(Ke-) 
n-1Ckp-1 B n-1CkKn-1 


, -1 
Ra-1 n-t 


But this means that the sequence £,=7,(%,a) (nX0) is a 
martingale, which proves Lemma 7.1. 


Lemma 7.2. 


The sequence p,(x) (n=0,1,---) converges almost everywhere. 


Proof. 

It is clear that for every fixed x¢A’, the quantity p,(#) 
coincides with one of the p,(z,a), where a runs through all 
the letters of the alphabet A, and where a is the same for all 
m (but different for the different x). Consequently, for any n, 


m and for any xeA’. 
| D(X) — D(X) | 233] Dik, a)— Dp, (%, a@) fs 


But the martingale 7,(z,@), being obviously bounded, converges 
almost everywhere, by Doob’s theorem (#6), no matter what 
the letter a is, and, therefore, the right side of the last in- 
equality approaches zero when n and m increase without limit; 
clearly this proves Lemma 7.2. 

The definition which we gave of the function 7,(x) assumes 
that u(C,)>0; the x for which »(C,)=0 are obviously a set of 
probability 0, and we exclude them from consideration from 
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the outset by keeping only the x for which «(C,)>0 for any 
m0. Thus, the sequence p,(x) is defined almost everywhere 
in A’. But then the sequence of functions 


9.Ax)= —lg p,(x) (n=0,1,---) 
will be defined almost everywhere, where the value g,(x)= + © 
[for p,(z)=0] -is not excluded for these functions. 
Lemma 7.3. 
Let E,. (nX0, kN0) denote the set of all x for which 
k.2g(«)<k+1; 
then 
(7.2) f 9.62) dul) £a(k-+1)2-*, 


En 


where a is the number of letters in the alphabet A. 


Proof. 

We define the cylinders B, as in the proof of Lemma 7.1, 
and let Z, denote the cylinder x,=a, where a is any letter of 
the alphabet A. For x¢B,Z,, we have: 


so that the value of g,(x) is uniquely determined by assigning 
the cylinder B, and the letter a. It is clear that 


BE =>)" BLZa 
where the 5}* denotes summation over those letters a for which 


k2g,(x)<k+1 for x¢B,Z. Therefore 


(7.3) Jf 92) dua)= If g,(2) dala 


Bren k B,Z, 
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But in any of the integrals of the right side we have 


k+1>g,(n)=— lg HBnhs) x x, 
u(B,) 
so that 


MB,Za) <2“ B,), 
and (7.38) gives 


fh aul) due) 2" E+) BZ) Lak+1)2* (By). 


BpEn ke 


Finally, summation over all possible cylinders B, gives 


f 962) dulce) Za(k-+1)2-* 

En yk 
and Lemma 7.3 is proved. As we shall see immediately, the 
fact that this estimate is uniform with respect to n is im- 
portant. In particular, the next proposition (uniform sum- 
mability of the functions g,(x)) is an immediate consequence 
of Lemma 7.3. 
Lemma 7.4. 

Given L>0O, let A,,, be the set of xe A’ for which g,(x)>L. 

Then, given any ¢>0, an L,=L,(e) can be found such that 


f 9.(@) due) <e, 
An,L 
for LX L,, and for any nX1. 

Another important consequence of Lemma 7.3 is that the 
absolute continuity of integrals of the functions g,(x) is also 
uniform with respect to ». Thus we have 
Lemma 7.5. 

Given any e>0, a &>0 can be found such that if Ee F, 
and p(h)<6, then 
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{ 952) du(x)<e, (n=1,2,---). 
E 


Proof. 
By Lemma 7.4, an L=L(e) can be found such that 


f 9X) dp(x)<é, (n=1,2,-+-). 


Put ba and let »(E)<&. Then 





f 9962) dula)= f° anf) dule)-+ f° a(x) dul) 


EAny EAn 1 
Z fan) dua) + Lu) <2¢, 
dnb 
and Lemma 7.5 is proved. A consequence is that almost every- 
where on A’ we have g,(x)< +e for any mX1. 
Now let us put 


lim 9,(%)=9(@). 


This limit exists almost everywhere, because of Lemma 7.2, if 


we permit the value + co for g(x). 


Lemma 7.6. 

The function g(x) is summable over A’. (In particular, 
this means that it can take the value + only on a set of prob- 
ability 0.) 

Proof. 
For any L>0 and any positive function f(x), we put 


f*(«)=min [L,f(2x)]. 


Then obviously it follows from g,(x)—>g(x) that gf(x)>g*(x) 
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(n> co). But since the functions gi(x) (n=1,2,---) are uniformly 
bounded, we find, using the known properties of the Lebesgue 
integral and Lemma 7.3, that 


f 9°@) dp(2)=lim f 932) dy Zlim sup { 9,(2) du(c) 
al Al Al 


=lim sup 33 f g,(2) du(z)<a 3) (k+ 2". 
n>oa k=0 k=O 
FEnik 
Since the right side is independent of L, Lemma 7.6 is 
proved. 


Lemma 7.7. 


{| 9.62)—9(2) |du(z)>0 (n> ©). 


Proof. 
Let ¢>0O be arbitrary. Let us denote by #, the set of 
aéA’ for which |g,(x)—g(x)|>e. Then 


Jf 19@)—9(@)|dule)= f 9,2) 91a) |du(x)-+ [| as(2)—9(2) | dala) 
al En En 
Z f a.(t) du(x)+ f g(x) du(z)+e. 
En En 

Since g,(z)— g(x) almost everywhere, »(#,)>0 (n>); con- 
sequently, the first integral on the right side is infinitesimally 
small as n—co by Lemma 7.5, and the same is true of the second 
integral by Lemma 7.6. This proves Lemma 7.7. 


#8. Proof of McMillan’s theorem 


McMillan’s theorem states (see #5) that as n->co the func- 
tion f,,(x) converges in probability to the number H, the entropy 
of the given source. Here 
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f(@)=—= Ie n(C), 


where C is the random sequence (cylinder) 2, %,--+,%,_-, (which 
is, of course, uniquely determined by giving x=---, %_,, %) %1,°°*). 
In order to use the results of #7, we must first of all relate 
the functions f,(@) to the functions g,(x), which we studied in 
#7. This relation is established by the following proposition. 
Lemma 8.1. 

For all x¢ A’ and nX1 


fale) => 3} a ("2 


Proof. 

Since we shall now deal with sequences 4,,---,2,,, for 
various values of r and s, we must introduce a more extensive 
notation. We denote the probability of such a sequence by 
pl w,,+++,2,,,] so that, for example 


Fle)=— lg pty ++) Gea]. 
n 


Similarly, the function p,(z), which we introduced in #7, can 
be written as 


[a potty 
pp(a)=EB aw Bo) 
ne bes x1] 


From this it is clear that for kX0 


p,(T*x)= Myatt) Te] 

n ? 
BL Xy-ns* wy Tp -1] 

and, in particular 


p,(T*x)= p[ Loy a) XJ : 
MEL s+, Ly] 
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This equality holds for KX1. Since p,(x)= p[2,] by definition, 
then 
Po Tx) = p(x) = wl%]; 
so that 
Tl p(T) = pag). eto Ml, wb toss Boat) ree eo, 
k=0 LL] BL oy" apes vy 2] 
Taking the logarithm of this equality yields 


S19 Tia) = —log pl t+, %-s]=n ful), 


and Lemma 8.1 is proved. 


McMillan’s Theorem. 

For any stationary source the sequence f,(x) converges in 
the L'-mean (and therefore also in probability) to some invariant 
function h(x). In the case of an ergodic source, h(x) coincides 
almost everywhere in A! with the entropy H of the source. 
Remark. 

J ,(a) is said to approach h(x) in the L'-mean if 


f \F@)-K@)| dulz)+0 (n>); 


it is well-known that this implies that /,(x) converges to h(x) in 
probability. As already noted, the function h(x) is called in- 
variant if h(Tx)=h(x) (we A’). 
Proof. 

One of the familiar forms of the ergodic theorem ([8], p. 
31, Satz 10; [9], equation (2.42)) states that for any function 
g(x) which is summable over A’, the quantity 


J t= 
— S(T *x) 
N k=0 


On the Fundamental Theorems of Information Theory 73 


approaches an invariant function h(x) in the L'-mean as n> co. 
In particular, this holds for our function g(x) because of Lemma 
7.6. But by Lemma 8.1 we have 


J \Fal@)—h(w) date) = [|S 0,72) —W(a)| dla) 





Z : |= Slo Tx) —o( T'2)] 





du(x)+ { 2 So T's) Wa) dala) 


Zt f jas2)~a(2) dn(a) + f 2S (Pe) Wa) | dyn), 





where in transforming the first term we use the stationarity 
of the source. (du(x) is replaced by dyu(T*x), and then Tx is 
introduced as a new variable of integration.) 

Consider each of the two terms on the right side separately. 
In the first term, the summands with increasing index approach 
zero by Lemma 7.7. Therefore, their average value, i.e., the 
whole first term on the right side, approaches zero as n>. 
The same is true for the second term by the very definition 
of the function h(x). Thus we find 


f F(a) —h(at) | du(s)+0 (n>), 


which proves the first part of McMillan’s theorem. 

In the case of an ergodic source, the ergodic theorem cited 
above states that the function A(x) is almost everywhere a 
constant h. Thus, to prove the second part of McMillan’s 
theorem, we must show that h=H. But from 


q \f,(2)—h|dp(z)>0 (n->00), 
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it follows that 
lim ce F(t) dul) = A h dulat)=h. 
al al 


The integral on the left side is the mathematical expectation 
of f,(x) which, as we saw in #5, approaches H as n-co. This 
means that h=H, which completes the proof of McMillan’s 
theorem. 

In particular, therefore, we have proved that every ergodic 
source has the F property: For arbitrarily small e>0 and &>0, 
and sufficiently large n, all the a” n-term sequences of the 
source output are divided into two groups, a high probability 
group, such that |(1/n) lg un(C)+H|<e for each of its sequences, 
and a low probability group, such that the sum of the probabili- 
ties of its sequences is less than 5. 


CHAPTER III. 


Channels and the sources driving them 


#9. Concept of channel. Noise. Stationarity. Anticipation 

and memory 

From a physical point of view, by a source is meant an 
apparatus which emits signals. The medium over which the 
signal is transmitted, is called a channel. The concept of 
channel, as well as the concept of source, plays a fundamental 
role in information theory. Just as in Chapter II we started 
the theory of sources by giving their precise mathematical 
characterization, we must now do the same thing for channels. 
We saw that the elements which can be used to characterize 
a source mathematically are its alphabet A, the probability 
space A’ with elementary events xz, and the probability measure 
wS) defined on the sets Se F',. What mathematical elements 
can be used to characterize a given channel? 

First of all, we must have a list of the signals which the 
channel in question can transmit. We shall always assume 
that this list is finite, and we shall call it the input alphabet 
(or the alphabet at the input) of the channel. The signals in 
this alphabet are called its letters. In general, the signals that 
emerge from a channel have an entirely different nature from 
the transmitted signals. Therefore, in addition to the input 
alphabet, we must know the output alphabet (or alphabet at 
the output) of the given channel, i.e., the list of signals (letters) 
which the channel can emit. We shall also always assume that 
the output alphabet is finite. 
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If every transmitted signal a (a letter of the input alphabet 
A) gives at the output a unique letter b=b(a) of the output 
alphabet B, the channel is called a noiseless channel. In general, 
the presence of interference (noise) causes different letters 
be B to be obtained at the output in different cases when the 
same letter a¢A is transmitted. Since it is natural to regard 
the interference (noise) as a random phenomenon, and it governs 
which letter b¢ B appears at the channel output when a given 
signal a¢ A is transmitted, then we can speak of the probability 
of obtaining the letter b¢ B at the channel output, given that 
the letter a¢ A was transmitted. In many cases, this probability 
depends not only on a but also on the sequence of signals 
transmitted earlier, and this dependence must often be taken 
into account. Therefore, at the outset it is expedient to give 
the most general form to the phenomenon described. As in 
Chapter II, we shall consider the set A’ of all sequences 


LH +++, Ly, Lp, Vy,-°* (LEA; —wo<i<+too), 


To every such sequence of transmitted signals corresponds a 
sequence 
Y= Yi Yor Yr? (YE By —e<t< +o) 

of received (output) signals. The probability that y, will be a 
given letter b¢ B must be regarded, in the general case, as 
depending on the whole set of signals x, sent, i.e., as a function 
of x¢ A’. We can denote this probability by v,(y,=)). We 
can also say that v,(y,=b) is the conditional probability (for a 
given sequence xe A’ of transmitted signals) that the sequence 
yé¢B' of received signals will belong to the cylinder y,=0). It 
is clear that in order to characterize the channel completely 


we must know such conditional probabilities not only for the 
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simplest cylinders of the type y,=0, but also for sets SCB’ 
of more complex structure. It is natural to require that such 
probabilities v,(S) be assigned for all cylinders SCB’; in this 
way, »,(S) will be automatically defined for all sets S in the 
Borel field F’, generated by the set of these cylinders. 

Thus, we shall consider a channel to be specified if we know 
the following three elements: 1) the input alphabet A, 2) the 
output alphabet B, 3) the probability v,(S) that the y received 
when a given x is transmitted belongs to the set Se F;. (This 
probability must be given for any x¢A’ and any S¢F',.) We 
shall denote the channel specified by these elements by [A,v,, B]. 
Evidently, v,(S) must be understood to be a one-parameter 
(with parameter x) family of probability measures defined on 
the space of elementary events B’. We shall call the channel 
[A, v,, B] stationary if, for all xe A’ and SeF, 


vr,(TS)=v,(S), 


where 7 is the same shift operator with which we were con- 
cerned in the preceding chapter. 

In the majority of applications, the probability v,(y,=6) that 
y, coincides with a given letter be B does not depend on ali 
the letters x=---,2%_,, %,%,,:+* of the transmitted message, but 
only on those with indices rather close to n. First of all, we 
shall always assume that the distribution of y, is independent 
of the transmitted signals that are transmitted after ~,, ie., 
is independent of x, for k>n; this means that v,(y,=b) has 
the same value for all transmitted messages x for which the 
signals ---,2,_,;,%, are identical. In this case, we speak of a 
channel without anticipation. As regards the signals 2, _,,%,_9)°*° 
preceding 2z,, in the majority of cases only a limited number 
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of them (e.g., 2,1, %,-2°**,2%_,-m) can influence the distribution 
of y,. This means that the probability v,(y,=b) is the same 
for all x with identical x,_,,,-++,2%,-1,2%,. In this case, we 


speak of a channel with a finite memory. The smallest number 
m for which the above holds, is called the memory of the 
channel; in particular, the distribution of y, for a channel 
without memory (m=0) depends only on z,. 


#10. Connection of the channel to the source 


Suppose we have a channel | A,v,, B] and a source [A, p] 
for which the alphabet A coincides with the input alphabet of 
the channel. Then, clearly, the channel [A,v,, B] can be used 
directly to transmit the output of the source [A,,]. Each 
symbol x, emitted by the source is one of the letters of the 
alphabet A, and therefore can act as an input to the channel 
[A,v,,B}. Then at the channel output we obtain a letter y, 
of the alphabet B. When the letters from some message 


G3 *,U_4y Loy Ly,° °° 


from the given source are fed into the channel one at a time, 


we obtain at the output the corresponding sequence 


Yrrrts Ys Yor Yi °° 


of letters from the alphabet B. In the general case, the dis- 
tribution law of evéry letter y, of this sequence depends on 
the entire sequence x, and is determined by the probability 
measure v,. For a channel without anticipation, this distribu- 
tion law depends only on the symbols ---,2,_,,2,; if, in addi- 
tion, the channel has finite memory m, then the distribution 


of y, depends only on %,_myt**,) Zn 
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This whole situation is described by saying that the source 
LA, »] feeds (drives) the channel [A,»,, B]. From the proba- 
bilistic point of view, such a connection of the channel and 
the source is a phenomenon in which chance intervenes in two 
ways: 1) the choice of the message x¢A’ is random (this 
randomness is governed by the distribution », and 2) for a 
given x¢A’, the ye B’ which is received at the channel output 
is random, because of the presence of noise. (This randomness 
is governed by the distribution »,.) 

Consider now the probability space in which the elementary 
events are all possible pairs (x, y), x¢ A’, ye B’. Let C be the 
set of all pairs (a,b), where ae A,beB. We ean regard C as 
a new alphabet; then, it is natural to denote by C’ the set of 
pairs (x,y) of which we just spoke. Thus the specification of 
(a, y)€C’ is equivalent to the specification of x¢ A’, ye B’. 

We must now introduce probabilities into the space C’. Let 
MCcA', NCB’, i, let M be some set of elements x and N 
some set of elements y; indeed, let Me F, (so that »(M) has 
a definite value) and let NeF,. Then it is clear that the 
direct product S=MxWN of the sets M and N is a set of pairs 
(x,y), so that SCC’. The probability w(S) of this set of the 
space C’ should naturally be understood to be the probability 
of the joint event x¢M,yeN. But the distribution in the 
space of elementary events x¢ A’ is determined by the yp» law, 
and, for a given 2, the distribution in the space of elementary 
events y¢B’ is determined by the v, law. Therefore 


(10.1) o(S)=o(Mx N)= sf: v(N) du(a). 


In particular, every cylinder ZCC’ is obviously the direct product 
of a cylinder Z,C A’ and a cylinder Z,C B’. Therefore, Eq. 
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(10.1) tells us the probability of any such cylinder Z. It follows 
by the general property of Borel sets that the probability o(S) 
is uniquely defined for any set Se F>. 

Thus we see that connecting the channel [A,v,, B] to the 
source [A,,] driving it uniquely determines a new source 
[C,w]. The alphabet of this source is the direct product A XB; 
the set C’ of elementary events (x,y) is the direct product 
A‘ xB‘, and the probability measure w(S) is given by (10.1) in 
the way described above. In what follows, we shall call this 
new source [C,w] the compound source constructed by connect- 
ing the channel [A,v,,B] to the source [A, yu]. Since we shall 
deal only with stationary sources and channels, the following 
almost obvious theorem will be of importance to us. 
Theorem. 

If the source [A, »] and the channel [ A, v,, B] are station- 
ary, then the source [C,w] is also stationary. 
Proof. 
Suppose SCC’ and S=XxX/Y, XeF,, YeF;. Then, first 
of all, it is obvious that TS=TXXTY. Therefore, Eq. (10.1) 
gives 
o(TS)=f v.(TY) dyla 
rEeTxX 
or, after substituting a new variable of integration 
o(TS)= f ve( TY) dp T2). 
T2ETX 
Since the source [A, »] is stationary, du(Tz)=dy(z), and since 
the channel [A,v,, B] is stationary, v7,(TY)=v,Y); therefore 


w(TS)= f v.(¥) du(z)=a1S). 


zEX 
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We have established this equality for all S=X~x Y, in parti- 
cular, for all cylinders SCC’; therefore it is also valid for all 
SefF,. This proves the theorem. 

Let us put M=A’ in Eq. (10.1), while leaving Ne F, arbitrary. 
The quantity w(MxN) is then the probability of the joint 
event 2¢ A’, ye N, and since the first of these two events is 
sure, o(MXWN) is simply the probability »(N) of obtaining a 
sequence y belonging to the set Ne F', at the channel output. 
Thus we see that the distribution 7(N) plays the same role for 
the space B’ as »u(M) does for the space A’. For M=A’, 
equation (10.1) becomes 


(10.2) 7(N)=0(A'X N)= y| "y(N) du(a). 


Thus we can speak of the source [B,7] at the channel output. 
This source, with the sequence 


YH= +9, U1) Yor Ys* ** 


of letters from the alphabet B as its output, is uniquely deter- 
mined (by using (10.2)) by the data of our problem, i.e., by the 
source A and the channel [A,»,,B). (Of course, the source 
[C,w] considered above is uniquely determined by the same 
data.) If the source [A,] and the channel [A,v,,B] are 
stationary, then, as we proved, the source [C,w] is also sta- 
tionary; but then 


n(TN)=0( A! X TN) =0( TAX TN) =0(A! X N)=7(N), 


and therefore the source [B,7] is also stationary. 

We proved in #3 that every stationary source has a definite 
entropy. Therefore, if the source [A,] and the channel 
[ A, v,, B} are stationary, each of the three sources [A, 1], [B,7] 
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and [C,w] has a definite entropy. Let us agree to denote these 
three entropies by H(X), H(Y), and H(X, Y), respectively. 
We defined the entropy H(X) as follows. Let H,(X) be the 
entropy of the finite space the elementary events of which are 
all the n-term sequences Lor -+,%,-, (cylinders) emitted by the 
source [A,y], with corresponding probabilities determined by 
the distribution »; then 


H(X)= lim LHX). 
nro nN 
In complete analogy 
H(Y)=lim 1 HY), 
nro WW 
H(X, Y)= lim L Hx, Y), 
n> n 


where correspondingly H,(Y) and H,(X, Y) denote the entropies 
of the finite spaces of sequences y,,---,Y,-, from the source 
[B,7] and 2, Yor'++,%a-1Yn-1 from the source [C, w]. 

But giving the “pair sequence” 2, Yo.°*+;2n-1) Yn_-1 18 Obvi- 
ously equivalent to giving the pair of sequences 2%,,---,2,_, and 
Yor *» Yn-1 SO that the space of sequences 2, Yo:+ +, Xn-1) Yn-1 
is the product of the spaces of sequences 2,,---,%,_, and Y,-°°- 


Yy,-1 Therefore, as we saw in $1 
H,(X, Y)=H(X)+Hx(Y)=H,(Y )+ Hy(X), 
where H,,(Y) denotes the average conditional entropy of the 
space of sequences Y,,--:, Y,-, for a given sequence 2,---, 2-1) 
and H,,(X) has an analogous meaning. It follows that 
HA Y)=4,(X, Y)—H,(X), 
H,/(X)=H,(X, Y)— 4,( Y). 
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Dividing all the terms of these equalities by n, and passing 


to the limit as n>, we find 


lim 1 H,.(¥)=H(X, Y)—H(X), 
n>2ow nN 


(10.3) 
lim 4 Hy(X) — H(X, Y)—H(Y). 


\ n»oo 


This means, first of all, that the limits on the left always exist. 
Let us denote them by H,(Y) and H,(X), respectively. Consider 
the conditional entropy per symbol of the output of the source 
[B,»] for a given output « of the source [A, »]; then H,(Y) 
ean be interpreted as the average of this entropy over x. H,(X) 
has a similar meaning with the roles of the sources [A, »] and 
[B,n] interchanged. 

In practice, this latter quantity is of the greatest interest. 
Let us recall that, on the one hand, we regarded the entropy 
of any finite space as a measure of the uncertainty contained 
in the space, and, on the other hand, as a measure of the 
amount of information given by “removing” this uncertainty, 
i.e, by answering the question of which event of the given 
space actually occurred. Thus H,(X) is a measure of the un- 
certainty in the sequence %,,:-:,%,_; The quantity H,,(X) is 
a measure of the average amount of uncertainty in the same 
sequence %,°--,%,-;, given that this sequence entered the 
channel and that the sequence y,,---,¥,-; was obtained at the 
output. Thus H,,(X) indicates how much uncertainty still 
remains in the sequence %,,---,#,-,; after it has been transmitted 
through the channel. Of course, we always have H,,(X)=0 
for a noiseless channel, for then knowledge of the received 
sequence y,,-:-,Y,-; tells us with certainty the transmitted 
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sequence %,:-:,2,_; In the general case, H,,(X)>0 and re- 
presents the “residual entropy” of the sequence %,---,%,-1, 
which it retains after passing through the channel. After 
dividing H,(X) and H,,(X) by m and passing to the limit as 
n—>o, we obtain H(X), the average uncertainty per signal of 
the source [A, yz], and H,(X), the average uncertainty per 
signal of the source after the message has passed through the 
channel. 

On the other hand, H,(x) is the amount of information con- 
tained in the sequence ~,,---,%,_,;, and H,,;(X) is the amount 
of information it retains after being transmitted through the 
channel. The difference H,(X)—H,y(X) is thus a measure of 
the average amount of information given by a sequence 
%y***,@_-1 transmitted through the channel. After dividing by 
nm and passing to the limit, we find that the difference H(X) 
—H,(X) is the amount of information obtained on the average 
when a signal of the source [A, x] is transmitted through the 
channel. This quantity 


R(X, Y)=HA(X)—HAy(X) 
is therefore the most important characteristic of the quality of 
transmission. It is natural to call it the rate of transmission 
of information from the given source [A,y] through the 
channel [A,v,,B]. We have by (10.3) 

H,(X)=H(X, Y)—H(Y), 
which means that 

R(X, Y)=H(X)+H(Y)—- A(X, Y). 


Thus the rate of transmission R(X, Y) is uniquely determined by 
giving the entropy of the three sources [A, »], [B, 7], and [C,]. 
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#11. The ergodic case 
Retaining the terminology and notation of the preceding 
sections, we now assume that the source [A, »] is ergodic and 
that the channel [A,v,, B] has finite memory m. We shall now 
show that in this case the compound source [C,w] and the 
source [B,7] at the channel output are also ergodic. 
Let Z be any cylinder of the space C’, obtained by fixing 
any sequence of pairs (2%, Yo),'**)(%;-1 Yj-1) OF, equivalently, 


the pair of sequences 
U=Xoye +5 Vj_45 V=Yor?* *s Yz-1- 


The sequences u and v (equivalently, the cylinder Z) will be 
regarded as constant throughout the proof. Now consider any 
fixed m-term sequence z=2_,,°**,%_,;, and denote by w the 
sequence %_,,°+*%_,%)+*+%;_, of constant length m+j=t. Since 
the source [A,»] is ergodic, we have almost everywhere in A’ 


(11.1) lim £31 g,(T*x)= p(w). 
nym YY k=0 
Furthermore, by Birkhoff’s theorem, the limit 


(11.2) lim LS 9, (Tra) =¥,,.2) 


S»xco § l=0 


exists almost everywhere in A’ for 0ZrZt—1, and for any 
sequence z. Let us denote by MCA’ the set of x for which 
(11.1) obtains and for which the limits (11.2) exist for any r 
(04ZrZt—1) and z. Obviously »iM)=1. 

The sum on the left side of (11.2) 


(11.3) Sg ,(T**"2) 
t=O 
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is clearly the number of values of 1 (04ls—1) for which 


T’*"x ew, ie., for which the sequence of length ¢ 


C= Xe eueme os Vests g-1 


coincides with the sequence w. (We note that the sequences ¢, 
fill the sequence %,_,.°**,%-_mss:-; Without gaps and without 
overlapping, as J runs through the values 0,1,---,s—1.) Let 
this number be denoted by 


D=p(3,7,2,2), 
where, by (11.2), for any x¢M 
(11.4) p=sy,,,(#)+0(s) 
as sco, and let l,,1,,---,l,, in order of increasing size, denote 


the values of 1 (012s—1) for which c¢, coincides with w. 
Furthermore, let y, (01~s—1) denote the (random) sequence 


Yrater’ ts Ureiesg-1 
of length 7. 

The sequences y,, (1<k>p) are of special interest to us. 
Let A, denote the event consisting of the sequence Y1, coincid- 
ing with the sequence v (which enters into the definition of 
the cylinder Z); obviously A, is the event (cylinder) T7’*'«'v of 
the space B’. Its probability for a given # is v,(A,)=v,(T7**'v). 
But our channel is without anticipation and has finite memory 
m. Therefore the event A, determined by the sequence Vy 


depends statistically only on 
Vrs iyt-mr* ty Ure tyt+g-t ’ 


ie., on the sequence ¢,, to which x belongs. But, by the 
definition of the number |, we have, for k=1,2,---, p 


C= T"* kw, 
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Thus we arrive at the conclusion that the probability v,(T’*'‘'v) 
has the same value for all x¢7T’*''w. Since our channel is 
stationary, this probability is the same as the probability v,(v) 
of the cylinder v for any xew. This latter probability, which 
it is convenient to denote by »v,(v), is independent of either k 
or 1. 

It follows that v,(A,)=v,,(v) for all k 1Zk2Zp). Moreover, 
it is an immediate consequence of the fact that the channel 
memory is m that the events A, are independent. We have 
p>o as s>oo (if e¢M), so that the events A,,---,A, form 
a classical Bernouilli scheme. All these events are defined in 
the space B’. If q denotes the number of values of k (12k Zp) 
for which A, occurs (for a given x¢M), then, as sco (which, 
for «eM, implies p>), we have by the strong law of large 
numbers 


(11.5) 1 y,,(v) 
p 


with probability 1. This means that if «eM and if N denotes 
the set of values of y¢B for which (11.5) holds as s>co, then 
v,.N)=1. The set N depends on 1, z, and 2, i., N=N(1,2,2), 
but r and z can assume only a finite number of different values. 


If we put 
II N(r,z,”) = N+* 
then obviously for «¢M 
v(N*)=1, 


and (11.5) holds for any r and z, if «eM, ye Nx. But since 
by (11.4) 


Pw (x) (soo) 
s 
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for any x¢M, then, for x¢M, ye Nz, we obtain 
(11.6) 41 > y(v) %,,(2) (se) 

s 


for any Yr and z. 


By the definition of the number g, we have immediately 
s-1 
q= 23 9 Ta) gATT ay) 


Summing this expression over r from 0 to t—1, we find, for 
xeM, ye N, using (11.6) 


S gel d ie 9 of T*y) =sy,,(v) si W,,.(2) +0(s). 


But by the definition of the function ¥, ,(v) and by the ergodi- 
city of the source [A,], we have, for x¢M 


32 aa) lim — 33 9o(T'2)=talw), 
so that 
(11.7) Soul T*x) g,(T*y) =st p(w) v,,(v)+0(s) 


for xe M, ye N>. 


Now we sum (11.7) over all possible sequences z. In doing 


so, we recall that w is a sequence of the type 2_,,---,2_,, 

Lor **,XL;-1, Where 2_,,,°**,%_, is now the variable sequence z, 

and %,,--+,%,., is the fixed sequence wu. Therefore S}w=u 
2 


and De w(W,V)=o(U,v)=o(Z). Since p(w)v,(v)=o(u,v) by the 


very definition of the distribution w, the right side of (11.7) 
becomes 


stw(u, v)-+0(s)=stw(Z)+0(s) 
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after summation over z. The sum of the terms in the left 
side of (11.7) can be written 


SS o(T'y) Sau T*2). 


Since 5} w=u, the inner sum equals g,(7*x); therefore, sum- 


ming the left side of (11.7) over z gives 
SS 9(T"2) 96(T"0), 
and we obtain for re M, ye NZ 
(11.8) = 9, (T*x, T*y) =sto(Z)+0(8). 


If Q is the set of pairs (z,y) for which x¢eM, ye N>, then by 
definition of the distribution w, and because »,(N*)=1, we have 


o(@)= f ¥.(NE) du(a)= uM) =1. 


Thus (11.8) obtains almost everywhere in C’. This means that 
the source [C,w] reflects the cylinder Z. Since this cylinder 
is arbitrary, then by the theorem of #4, we have proved that 
the source [C,w] is ergodic. 

It follows at once that the source [B;7] is ergodic. Indeed, 
if Ne F’ is an invariant set in B’, then A’XN is obviously an 
invariant set in C’. Since the source [C,w] has already been 
proved to be ergodic, then 7(N)=o(A’XN) equals 0 or 1; but 
this means that the source [B,7] is ergodic. 


CHAPTER IV. 


Feinstein’s Fundamental Lemma 


#12. Formulation of the problem 


The authors of most investigations on the foundations of 
information theory agree in considering the culmination of the 
theory of discrete information to be Shannon’s theorems on the 
optimum use that can be made of noisy channels by suitably 
coding the transmitted text. The proofs of these theorems 
are given only sketchily in Shannon’s works, the analysis being 
limited to sources of the Markov chain type. In his work, 
McMillan considerably refined in mathematical respects the 
fundamental concepts of Shannon’s theory. (It is just these 
refined concepts that we used in the preceding sections.) More- 
over, he outlined a method of proving Shannon’s theorems for 
any ergodic source; more exactly, he tried to carry over to 
this case a proof the idea of which had been given by Shannon. 
In this connection, a number of gaps were discovered, which 
evidently can not be filled even in the special case of the 
Markov sources considered by Shannon. 

In 1953, Feinstein [4] proposed a fundamentally new way of 
substantiating Shannon’s theorems, which makes the whole 
theory considerably more transparent. Feinstein’s idea consists 
in deriving from the channel itself as much as can possibly be 
used to prove Shannon’s theorems, before coding and even before 
connecting the channel to any particular source. This is done 
by proving a proposition which it is natural to call the funda- 


mental lemma of the whole theory, and which is formulated 
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without any mention of either sources or codes. The proof of 
this proposition is our next problem. It should be noted that 
Feinstein carries out the proof only for channels without 
memory (m=0) and only alludes to possible generalizations. It 
is, of course, very important to know to how broad a class of 
channels the fundamental lemma applies. We also note that 
both the statement and the proof of the lemma as given by 
Feinstein contain a number of inaccuracies which, however, can 
be corrected easily. 

Consider a stationary channel [ A, v,, B], without anticipation 
and with finite memory m. If this channel is fed by a station- 


ary source [A, »], we obtain a definite rate of transmission 
R(X, Y)=H(X)—H,(X), 


which (see #10) can be regarded as the amount of information 
obtained on the average when one letter is transmitted. Of 
course, this rate depends on the source A, driving the channel. 
However, the least upper bound of R(X, Y) for all possible 
ergodic sources [A, »] is a quantity which depends only on the 
channel [A,v,, B] itself. We shall denote this quantity by C, 
and we shall call it the (ergodic) capacity of the channel. 

As before, we denote the elements of the sets A’ and B’ by 
GH +++, 4) Vor Bye ANd Y=" + 'Y 4, Yor Yrs» 
respectively. Let nm be any fixed positive integer. Let us 
agree to denote by uw the sequence (cylinder w_,,°+-,%_1, 
Lot +*,X,-, Of length n+m, where all the x, take on definite 
values (letters of the alphabet A). Obviously, the number of 
different sequences u is a™*”, where a is the number of 
letters in the alphabet A. Similarly, let us denote by wv the 
sequence (cylinder) y,,---,y,_,; of length mn, where all the y; 
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are letters of the alphabet B. Obviously, the number of dif- 
ferent sequences v is 6”, where 6 is the number of letters in 
the alphabet B. Since the memory of the channel is m, then 
the probability v,(v) of receiving y¢v at the channel output 
when a given x is transmitted is the same for all x belonging 
to the same sequence uw. Thus, this probability depends only 
on the sequences u and v selected, and it is natural to denote 
it by »v,(v). Similarly, if V is the union of several sequences 
v, we shall denote by v,(V) the probability of the event ye V 


when any 2véu is transmitted. 


Now, let 2 be any constant (o<a<<). Let us agree to call 


a group {u;} (12i2N) of u-sequences distinguishable if there 
exists a group {V,} (12i2N) of sets V,;C B’ where each V, 
(1.4i12N) isa set of several sequences v, such that 1) V, and 
V, have no sequences in common for 1k, and 2) »v,(V,)>1—A 
(1ZiZN). Clearly, this definition of a distinguishable group 
depends on the parameter 2. We can now formulate Feinstein’s 
fundamental lemma as follows. 

Lemma. 

If a given channel is stationary, without anticipation, and 
with finite memory m, then, for sufficiently small 2>0 and 
sufficiently large n, there exists a distinguishable group {u,} 
1212N) of u-sequences with 


Nr>2Xe-» 


members, where C is the (ergodic) capacity of the channel. 
The importance of a proposition 6f this kind in evaluating 

the optimum transmission of information is clear almost im- 

mediately. If 2 is small, since v,(V)>1—A4, when a sequence 


u, is transmitted, then at the channel output we obtain with 
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overwhelming probability a sequence v in the group V,; since 
the groups V, have no v in common, then, knowing v, we 
uniquely determine the group V;, which contains it, which 
means that we guess with an overwhelming probability the 
transmitted sequence u;. Of course all this is under the condi- 
tion that the group {u,} is distinguishable. Thus, if only 
sequences of a certain distinguishable group are used to send 
signals, then, despite the noise, the signals sent can be guessed 
with overwhelming probability. The question of whether any 
text suitable for transmission can be “encoded” into the 
sequences u, depends, of course, on how many such sequences 
there are in the first place. The fundamental lemma shows 
at once that the number of such sequences is very great for 
sufficiently large n. After this lemma has been proved, we 
shall consider the question of how to use the estimate of N 
that it affords to compute an index of optimum transmission 
of given material through a given channel. In fact, this is 
the way we shall approach the basic Shannon theorems. 


#13. Proof of the lemma 
We shall divide the proof into a number of separate steps 
in the interest of a better presentation. 
1. The source [A,p]. By definition of the number C, we 
can find an ergodic source [A, nz] for the channel [A, v,, B] 
such that when this source drives the channel, we have a rate 


of transmission (see the end of #10) 


R(X, Y)=H(X)—H,(X)>C—~. 


Here H(X) denotes the entropy of the source [A, uw] and H,(X) 


the average conditional entropy of the same source, given the 
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signal received at the channel output. The source [A, p], 
being ergodic, has the EF property according to McMillan’s 
theorem (see #5). This means the following. Let w denote an 
arbitrary sequence 2,,---,2,_, of letters from the alphabet A 
(w is therefore some cylinder in A’). Moreover, let W,CA’ 


denote the set (union) of all cylinders w for which 
(13.1) 1g ul) 1 Ax) Zi; 
nN 


then, for arbitrarily small 2>0 we have for sufficiently large n 
me W,)>1-4. 


2. The source [C,w]. Connecting the source [A, »] to the 
channel [A,v,,B] gives us according to the definitions of #10 
the compound source [C,w], with the alphabet C=AXB and 
entropy H(X, Y). By the theorems of #11, this source is 
ergodic, and therefore, by McMillan’s theorem (#5) has the HF 
property. This means the following. Each pair (w, v) of n-term 
sequences [wC_A’, vCB’] is some cylinder in the space C’, 
with a definite probability w(w,v) in this space. Let Z denote 
the set of all cylinders (w,v) for which 


(13.2) 1g ow.) + AX, Y) 24; 
n 


then, for arbitrarily small 2>0 and 6>0, we have, for suffi- 
ciently large n 
wo(Z)>1—6. 
3. The source [B,y]. This source, which we described in 


#10, is determined by the relation 
(M)= fv.) du(a)=0(A'xM) (Me Fy). 
al 
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We proved that it is ergodic at the end of #11. According to 
MecMillan’s theorem it has the E property. This means the 
following. The probability of the set of n-term sequences 
vC B’ for which 


(13.3) len) 4 acyy|z4 
n 4 


is as close to 1 as desired for sufficiently large n. (H(Y) denotes 
the entropy of the source [B, 7].) 

4. The set A,, and its probability. Let us denote by X the 
set of pairs of sequences (w, v) for which the inequalities (13.2) 
and (13.8) are satisfied. By the foregoing, we have 


o(X)> 1-32 


if n is large enough. Now, for any sequence w, let A,,CB' 
denote the set of all sequences v for which (w,v)¢X, i.e. for 
which the inequalities (13.2) and (13.3) are satisfied. Further- 
more, let W,C A’ denote the set of all sequences wC W, for 
which 


,A,,) wo(w, A ) a 

13.4 ww, Ay) _ o(W, Ay) ~4_ 4 . 

vee) p(w) w(w, B’) 2 

We now estimate the probability of the set W, by using Lemma 
2.1. The roles of the spaces A, B, AB are played by the sets 
of all sequences w, v, (w,v), respectively. For Z we take the 
set X, for U, the set W,. Then clearly =i plays the role of 
6, and 4 that of 6,. The set of sequences v for which (w,v)¢X, 


i.e., J',,=B'—A,,, takes the place of the set J. Finally, the 
set of wC W, for which the conditional probability 


96 KHINCHIN 
PX.)=1-P,(A,)=1- 2 Aw) 2g 
uw ) 


plays the role of the set U,. Putting a=, we can take our 


set W, as the set U, of Lemma 2.1. We see at once that all 
the premises of Lemma 2.1 have been fulfilled. Using it, we 
conclude that 


(13.5) p(W,) >1—2—a=1—22. 

Now let the sequence w belong to the set W, and the 
sequence wv to the set A,,, so that by the definition of the set 
A,, we have (w,v)CX. Since W,C W,, then, by the definition 
of the sets X and W,, all three inequalities (13.1), (13.2), (13.3) 
are fulfilled for the pairs (w,v). In particular, we have 


lg p(w) +H(x)Z4; len sayy; 
n 4 n 4 
Heo) 4 ax yyN—4 , 
55 4 


whence 


1 w(W, Vv) 
p(w) nv) 
Since (see #10) 


+n[H(X, Y)—H(X)—H(Y)]\ —S ma 


A(X)+H(Y)—H(X, Y)=R(X, Y), 
we find 
w(W, Vv) _3 
lg tay aye | BR ¥) 4 a], 
and therefore 


o(W, v) NX gantry, 1 - 21y(v), 
ww) 


On the Fundamental Theorems of Information Theory 97 


Since we have R(X, Y)>c-4 by the choice of the source 
[A, »], then 


o(w, v) > Qnce-a n(v). 
p(w) 
This inequality holds for any wCW,, vCA,, Keeping the 


sequence wC W, fixed, and summing over all vCA,, we find 


w(w, A,,) > Qne-» n(A,,)s 

p(w) 
but the left side of this inequality does not exceed unity, since 
w(w, A,,) Zo(w, B)=p(w). Therefore we find that for any 


sequence wC W, 
(13.6) nA) 258: 


5. Special groups of w-sequences. We agree to call the group 
{w,(12t2ZN) of w-sequence special if a set B, of v-sequences 
can be associated with each sequence w, of this group such 
that 

1) the sets B, and B, have no common sequence for i~k; 
g) 2B) 4 1 ZiZN); 
MW; 
3): 4 By<o teY 1412N). 
First of all, let us convince ourselves that special groups 


exist. To do this, we take any sequence wC W, and put B=A,,. 
Then by (13.4) and (13.6), we have 


OW, BY 1a; of BZ 2-8, 
p(w) 
Therefore, every sequence wC W, is a special group, which 
proves the existence of special groups. Now let us eall a 
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special group of w-sequences maximal, if adding any other 
sequence to it causes it to lose the character of a special group.* 
In view of the foregoing, the existence of a maximal group is 
obvious. 

6. Estimate of the number of members of a maximal special 
group. Let M={w,} (12Z12N) be a maximal special group 


of w-sequences. Now, for any sequence w, let us put 

N 

A;—A. > B= B 

i=l 
Obviously the set B,C B’ has no elements in common with any 
of the sets B, (1Zi ZN). If wCW,, then by (13.6) 

A By) Z (Ay) <2°-*™, 

If the sequence wC W, did not belong to the group M and if 


we had at the same time 
w(W, B,,) 
p(w) 


then adding this sequence w to the group M would evidently 


>1-A, 


again give a special group of w-sequences, which is impossible, 
since the group M is assumed to be maximal. Therefore, if 
wC W, and wéM (ie., if wC W,—MW,), then 
(13.7) wo, By) 13. 
p(w) 

But by the definition of the set B,, 

N 

ow, B.)=0(, A,)—o(w, A, >)B,), 
i=] 


/ 


and if wc W,—MW,, it follows from (13.4) and (13.7) that 


*If the set of all w-sequences is a special group, then we consider this group to be 
maximal also. 
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N 
of w, A, pa B,\=a(w, A,,)—o(w, B,,) 
im] 


> (1-4 aon) — auto) =F pw) 


From this we find that 


N N 
= 7 ae s B+ S o(w, 31 By) 
(13.8) wow, - MW, t=1 woMW, t=1 
a + pay me + a W, 3) B,) 
a CMW, 4=1 


SSL Wi) MMW,)] +2), 


where only the estimate of the last sum requires explanation. 
The fact is that if wC@ MW,CM, the sequence w is one of the 
sequences w,; of the special group M. For example, if w=w, 
then 


of w, >) B,) S (Wy By >(1—d) p(w), 
and therefore re 
z= of w, 3 B,) > (1—2) AMW,). 


woMW, 


Furthermore, using (13.5) we find that for 22 > (13.8) becomes 
N ‘ 2 2 
(8.9) o( 1B) S42) +(1-2-2) m,) 


A A 
SN WW,)> % (1—22)=4. 
aio u(W)) : ( )=7 
But since by the definition of a special group 


nN B)<2°"-” (1ZiZN), 
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then, on the other hand 
N 
7103 B,)<N-2-"°™, 
t=1 


Comparing this with the inequality (13.9) we find 
Neyo’, 

and therefore, for sufficiently large n 
N> Onc - 2a) 


Thus we have established an important lower bound for the 
number of terms in any maximal special group of w-sequences. 

7. Completion of the proof of the fundamental lemma. Let 
us consider an arbitrary maximal special group {w,} (1212N) 
of w-sequences, and let us extend each sequence w, of this 
group by adding m more letters to the left, obtaining in this 
way a new sequence wu; of length n+m. Let us convince our- 
selves that by selecting this extension in a suitable way, we 


have 


(Us B;) 
BU; 


S14 U.ZtZ NN). 


Every possible extension of the sequence w; by m letters to 
the left gives a cylinder (sequence) wu; contained in the cylinder 
w, (u,;cw,), and the union of all the cylinders u,Cw; is the 
cylinder w, Therefore for 1412N 

o(W,, B,) _ o(u,;, B,) _ o(U:,B,) pu) 
p(w) ue, p(W,) aie, pu) pw,” 
The left side of this equality exceeds 1—2 by the definition of 


a special group. The sum of the second factors in the sum on 
the right side is obviously unity. Therefore at least one of 
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the first factors must exceed 1—A; but this means that, for at 


least one extension, we must have 


w(u,, B,) 
WU; 


(13.10) >1-2, 


as asserted. 
By the definition of the distribution w 


c(t, B= f v.(B) du(z). 


But, as we saw in #12, »,(B,) assumes the same value for all 


zé€u, Which we agreed to denote by v,(B;). Thus 
o(U;, By) = w(U,)ra (Bi). 
and (13.10) can be written as 
v4, (B)>1l-A4 (1 4t2ZN). 


Selecting the extension of the sequence w, in the manner 
indicated above, we obtain a group of N sequences u, of length 
n-+m (of letters from the alphabet A) and a group of WN sets 
B,, each of which is the union of several sequences v of length 
n (of letters from the alphabet B), where 1) the sets B, and 
B, have no common elements for i#k, 2) v,(B,))>1—-A¢d Zi ZN), 
and 8) N>2"¢-*, This just means that the sequences u; 
(12ZiZN) form a distinguishable group. But since the number 
of terms of this group exceeds 2*°-*” and 2>0 can be chosen 
as small as desired, the proof of the fundamental lemma is 


complete. 


CHAPTER V. 


Shannon’s Theorems 


£14. Coding 

Up to now, we have always assumed that the source feed- 
ing the given channel has an alphabet which is identical with 
the input alphabet of the channel. However, in practice these 
alphabets are different in the majority of cases, and we must 
now consider the general case. 

Suppose the output of the stationary source [A,, yu] is to 
be transmitted by means of the stationary channel [A, »,, B] 
where in general the alphabets A, and A are different. Then, 
before transmitting, it is necessary to transform (“encode”) the 
sequence of letters from the alphabet A, emanating from the 
source into some sequence of letters from the alphabet A. As 
usual, we suppose that any message emanating from the source 


LA, #1 is in the form of a sequence 
(14.1) OS 98, O24 Oo, Ouse es 


where all the 6; are letters of the alphabet A,. We uniquely 


transform (‘‘encode’’) each 6 into some sequence 
(14.2) U8, Lia, Voy Ly + *, 


where all the x; are letters of the alphabet A. The rules of this 
transformation constitute the ‘‘code” used. Thus, from the 


mathematical point of view, a code is simply any function 
t= 70); 


where 6€Aj, x¢€A’'. It is clear that formally every code can 
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be regarded as a kind of channel with input alphabet A, and 
output alphabet A. Such channels are characterized as being 
noiseless, i.e., to a sequence @ at the channel input corresponds 
a unique sequence x=2(0) at the output. We can give this channel 
the usual designation [A),, A], where the form of the func- 
tion pM) (Me A’) is also apparent, namely 


1 [ax(0)eM] 


mO= 9 Ca(0eM). 


However, not every code has practical value. If we must 
actually know (as will be necessary in the general case) the 
whole sequence @, i.e., the whole infinite message from the 
source [A,,y] in order to determine any letter x, of the 
encoded text, then obviously, in practice, we can never deter- 
mine this letter. Consequently, the only codes of importance 
in applications are those such that it suffices to know some 
finite sequence of letters 6,¢ A, in order to uniquely determine 
each letter x,¢A (and also each finite sequence of such letters). 
In particular, in information theory ‘sequential coding”, is used 
predominantly, which consists of the following: Both the se- 
quences (14.1) and (14.2) are divided into finite subsequences of 
any length which are numbered from left to right, just like 
the letters of these sequences. A rule is given which uniquely 
determines the k’th subsequence of the sequence x in terms of 
the k’th subsequence of the sequence @. 

Now let us turn to the problem of transmitting the output 
of the source [A,, 1] by means of the channel [A,v,,B]. We 
code each message 6 of the source [A,, »] into some specific 
message x, composed of letters from the alphabet A. We then 
pass this « through the channel [A,v,, B] and obtain some 
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message yé B' at its output. Obviously, the result of connecting 
the code selected to the channel [A,v,,B] can be regarded as 
a new channel [A,,%,B]. It is very easy to determine the 
function (probability) 2(@)(Q¢F',). Since the coding trans- 
forms the message 6¢Aj into the message 2(0)¢ A’, the prob- 
ability 2(Q) of obtaining y¢Q for a given 6 is the probability 
of obtaining ye€Q if the message x=2(@) enters the input of 
the channel [A, v,, B], i.e. 


4(Q)= ¥z@(Q), 


and the channel [A,, 2%, B] can be written as [A),v,.0, B]. 

The process which we are examining consists in connecting the 
source [A,,u] to this new channel [A,,%4,B]; this time the 
source alphabet coincides with the channel input alphabet, so 
that we have a situation with which we are familiar. In 
particular, connecting the source to the channel gives, as we 
know, the “compound” source [C,w] where C=AXB, and 
where the probability distribution is such that for MeF,,, 
Ne F, we have 


(MXN) = f AN) du(0)= f veo(N) dul6). 


M 


#15. The first Shannon theorem 

Suppose we have an ergodic source ! A,, »] and a stationary, 
non-anticipating channel [A,v,, B] with finite memory m. Let 
the entropy of the source [A,, 1] be H,, and let the ergodic 
capacity of the channel [A,v,,B] be C. We assume that H,<C, 


and choose a positive number 1<(C-H,). Then we select a 
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particular code which transforms the output of the source 
[A,, #2] into the alphabet A of the given channel. 

First of all, we note that the source [A), x], being ergodic, 
has the FE property (see #5). In particular, this means that 
for sufficiently large n, the sequences a of letters from the 
alphabet A, can be divided into two groups, a “high prob- 


ability’? group, for each sequence a of which 


Ig oO) sy 
n 


or, equivalently 
u(e) 2 ligt 8 


and a “low probability” group, the total probability of which 
is as small as desired. Obviously the number of sequences in 
the high probability group is less than 2%%/0*”®<2"-*, In what 
follows we shall denote these sequences by a, a.,---. We shall 
denote the set of all sequences in the low probability group 
by a. 

Now we turn to the channel [A, v,, B]. By the fundamental 
lemma of Chapter IV, for sufficiently large n there exists a 
distinguishable group {u,} (1212ZWN) of sequences wu of length 
n+ (consisting of letters from the alphabet A), with N>2*°-” 
members. Thus, since N is larger than the number of sequences 
a; in the high probability group just considered, we can associ- 
ate with each sequence a, a sequence u, such that different u, 
correspond to different a,. In doing this, at least one sequence 
of the distinguishable group has not been used; we associate 
this sequence with all the sequences a, of the low probability 
group. After doing this, to each n-term sequence a of letters 
from the alphabet A, corresponds a specific sequence u of 
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length ~+m from the alphabet A, a sequence which belongs 
to the distinguishable group {u,}. 

We now divide the sequence @ of letters from the alphabet 
A,, which is to be coded, into subsequences of length n, and 
the sequence x, into which @ is to be encoded, into subsequences 
of length n+m. We number both sets of subsequences from 
left to right, as usual. The k’th subsequence in the message 
@ will be one of the sequences a; then we select as the k’th 
subsequence in the message x, the sequence u, of our distin- 
guishable group which corresponds to this sequence a. Doing 
this for all k (-o<k<+o), it is clear that we uniquely 
determine x=2(@) in terms of the given 6. In this way we have 
set up a specific code, which we shall retain in all that follows. 
Obviously, this code is an example of the “sequential coding” 
discussed in #14. 

Connecting the code just selected to the channel [A,»v,, B] 
we obtain, as we saw in #14, a new channel [A,, 4, B], where 
we have 4(Q)=v,.(Q) for Qe F',. Making the source [A,, »] 
feed this new channel, we arrive, according to #14, at the 
“compound” source [C,w], where C=A,xXB and wo (MxN) 
(where Me F,, NeF;) is given by the last formula of #14. 
Let us agree to denote the different sequences of length n, 
consisting of letters from the alphabet B, by 8, (k=1,2,---). 
Then, by the definition of the distribution w, we have for any 
kN1 and 1=0,1,---,N 


(a. XBx)= f 2(Bx)du(0)= fra (B.) dulO). 
But @¢a, is equivalent to x(@)eu,*, and since »,(@,) assumes 


* Here wo is understood to be the sequence of the distinguishable group {v;} Into which 
the whole “‘low probability group"' ¢y is encoded. 
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the same value for all x¢u,, which we agreed in $12 to desig- 
nate by v,(@,), then we have v,(@,)=»,(8.) for all O¢a,. 
Thus we find 


(15.1) o(a; X By) = wla,) vu (Px). 


In order to avoid any misunderstanding, we recall the mean- 
ing of the event a,;X8,, the probability of which is given by 
(15.1): a,;X@, denotes the joint event that 1) the n-term se- 
quence transmitted by the source [A,, x] coincides with a; (if 
i>0) or belongs to the low probability group (if 1=0), and 2) 
after being transmitted through the channel [ A,,a, B], the n- 
term sequence transmitted by the source [A), »] gives an n+ m- 
term sequence of letters from the alphabet B, the last n letters 
of which are the sequence 8£,. 

Now let i, be the value of the subscript 1 (02i2N) for 
which the probability w(a, 8,)* has its greatest value. (If there 
are several such values 7, then we take any of them as i,). 
Since the conditional probability of the sequence a; for a given 
sequence #, equals 

o(ai, Bx) ; 

w( Az, Bx) 
where the denominator is independent of 1, then we can also 
say a,, is the sequence a, which is most probable for a given 


sequence @,. Let us set 


D1 Di ofa, B,.)=P. 


k tft, 


Obviously, we can regard P as the probability that the sequence 
a, at the input of the channel [A), 2, B] will not be the most 
probable one for a given sequence B, at the output of the channel. 


* w(a,8,) and w(a;x6,) have the same meaning. 
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The sequences u,, into which we encoded all the sequences 
a, form a distinguishable group; this implies the existence of 
a group {B,} (12727) (where every B, is the union of several 
sequences @,) such that 1) v,(B,)>1—A, and 2) B, and B, have 
no sequences in common for ij. But summing (15.1) over 
all 8,C B,, we find 


o(a;, B= wa). (B)>U—aula,) (0412N). 


Therefore the conditional probability of obtaining a sequence 
8,CB, at the output of the channel [A,, 2, B], given that we 
had a sequence a, at the input of the channel is 

P, (B= 22 BD 51-7 (Zi ZN). 

Ma; 

Thus we see that the two complete systems of events a; 
(004i ZN) and &, (k=1,2,---) satisfy all the premises of Lemma 
2.2. Using this lemma, we find 


PZ: 


This important result is the first Shannon theorem, which we 


can state as follows. 


Theorem. 

Let there be given 1) a stationary, non-anticipating channel 
[A,v,,B] with ergodic capacity C and finite memory m, and 
2) an ergodic source [A,,p] with entropy H,<C. Let e>0. 
Then, for sufficiently large n, the output of the source [Ay, u] 
can be encoded into the alphabet A in such a way that each 
sequence a;of n letters from the alphabet A,is mapped into a 
sequence u, of n+ letters from the alphabet A, and such that 
if the sequence u,; is transmitted through the given channel, we 


can determine the transmitted sequence a, with a probability 
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greater than 1—e from the sequence received at the channel 
output. 

The definition of the sequence a, entails that we select as a; 
the sequence which is most probable, given the nm letters last 
received at the channel output. The first Shannon theorem is 
often formulated less exactly by saying that for H,<C it is 
always possible to find a code such that the transmitted se- 
quence can be guessed from the received sequence with a prob- 


able error which does not exceed an arbitrarily small number e. 


#16. The second Shannon theorem 


We retain, in toto, the source [ Ay, 1], the channel [A, v,, B] 
and the code of the preceding section, and also all the notation 
introduced there. However, we now consider a new problem 
pertinent to this whole process, namely, the evaluation of its 
transmission rate, i.e., the amount of information given on the 
average by one letter at the channel output. To this end, let 
us examine first the finite probability space a0, (0Z1Z2N, 
k=1,2,---), with the distribution w(a;, @,), which was considered 
in $15. We recall that a, was one of the n-term sequences of 
the high probability group in the output of the given source 
for iX1, that a was the whole low probability group of such 
sequences and that 8, was an n-term sequence of letters from 
the alphabet B (at the channel output). We denoted by P the 
quantity 


P= » 2 oi, Bry 
n 


iFip 
where 7, is the value of the subscript 7 for which the prob- 


ability w(a;,8,) is the greatest, and we proved that P<a. 
It is clear that we find ourselves in a position to apply Lemma 
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2.3; the spaces {a,} and {@,} play the role of the spaces {A,} 
and {B,} of this lemma, and the number N+1 plays the role 
of the number n. If, as usual, we denote by H;(a) the condi- 
tional entropy of the space {a,} for a given @,, averaged over 
all @,, then by Lemma 2.3 we have 


Ha) ZP lg N—Plg P—(1—P) lg (1—P). 


Here the number N is the number of sequences in the high 
probability group; as we saw in #15, it does not exceed 
2no-% < 927° whence lg N<nC. Taking into account that P<2 
and that, as is easily seen —Plg P—(1—P)lg(1—P)<1 for 
any P (0<P<1), we obtain 


,(a)<ianC+1. 
In particular, since 4 can be chosen arbitrarily small for suf- 
ficiently large n, we obtain 
Hy(a)==0(n) (n>), 
Here 
Hy(a)=— S38.) >} flp, (a) 


where 


= A = I _ (4; By) _ (ain Bx) 
J(2)=2 lg x; (8.)=0(A ory) and Dp,(@,) = w(A!, By) (By) . 


In all the foregoing a, unlike a, for 1>0, denotes not an 
individual sequence, but the whole “low probability” group of 
m-term sequences made up of letters from the alphabet A. 
Now we divide this group into separate sequences 


4 / / 
Ay, Age °° y Hy 


and instead of the space (a,) (02i2N) we consider the space 
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(a, a) 1Zi12ZN,12724q) of all n-term sequences consisting 
of letters from the alphabet A,, with appropriate probabilities 
p(a;), pa). We leave the space (8,) as before. Denoting by 
H,(a, a’) the conditional entropy of the space (a,, a) for a given 
&,, averaged over 6,, we find 


Bia, a')=— S3H(8.){ 3 FLen(adl+ 33 /Len (a) } 


6.1 
ae) =: Fy(a) + S39(B) fey (ao)] +B, 
where 
q 
(16.2) R=— Y39(8,.) 3 FLPy (aI. 
(In the interest of complete clarity, we note that here 
, 
D a, = (My, B,) : 
nd ? (Bx) 


where 


olay By)= f 2(Bx) dul). 


Since 2,(8,) has the same value 4,,(8,) for all @¢a;Ca,, then 


4.68 ‘ pa) 
n(B i) 


is uniquely given by the data of our problem.) 


Dp,(a) = 


Since the second term in the right side of (16.1) is nega- 
tive, then 


(16.3) Hy(a, a')<H;(a)+R, 


and since we have already estimated H,(a), it remains only to 
estimate the quantity R defined by the sum (16.2). Using Lemma 
1.1, we find 
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q 
(16.4) RAZ—% nla) lg nla), 
Here, by definition of the low-probability group, we have 
3 p(a5)= p(a,)<2. But it is easily seen that the greatest value 
j= 


of the sum (16.4), under the supplementary condition 
# , 
>} play) =e 
yA 


is attained for wot) = = (14j72q) and is e| le q+lg =|. 
q E 


Therefore, in our case 
R<A le gt+lg =| : 


But q is the number of sequences in the low probability group 
and is less than a”, the number of all n-term sequences, (a is 
the number of letters in the alphabet A,.) Therefore 


R<dnlga+t+aAlg +<in lga+1. 


We see that R=o(n) as noo, and since H,(a)=o0(n), as we 
proved earlier, it follows from (16.3) that 


(16.5) H;(a, a’)=0(n) (n>). 


We now change our notation somewhat. By the space 
a(a,, @,*-*) we shall mean the set of all n-term sequences of 
letters from the alphabet A,, whether from the high prob- 
ability or low probability groups, so that the number of all a, 
is a”, where a is the number of letters in the alphabet A,. As 
before, the space §(G,,8.,..-) denotes the set of all n-term 
sequences of letters from the alphabet B. The product (a, 8) 
of these two spaces has the distribution 


w(a,, B,)= ‘i 4B.) bul O) = da (Bx) ua), 
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and the distributions of the spaces a and 8 are given by the 


functions »(a;) and 
Bx) =0( As, By) = SS ular) bo(Bo) 


respectively. The conditional entropy of the space a for a given 
8, averaged over §,, which we denote by H,(a) is obviously 
the quantity H,(a, a’) which we have just estimated. There- 
fore, by (16.5) we have 


(16.6) H,(a)=o0(n) (n>). 


The information transmission process which we are consider- 
ing is the following. The output of the source [A,, »] is “cut 
up” into sequences of length n, and each such sequence a; is 
transmitted through the channel [A,, 2, B], giving at the out- 
put a sequence of n+m letters from the alphabet A. The last 
n letters of this sequence form the “received” sequences £,. 
The output of our transmission process consists of a sequence 
of such sequences. Our problem is to estimate the rate of 
transmission. To do this, we consider a sequence of length 
s=nt+r from the output of the source [A,, »], where ¢ is any 
positive integer and 0~r<n. We denote such a sequence by 
X and the set (space) of all such sequences (for a given s) by 
{X}. Let the s-term sequence Y (of letters from the alphabet 
B) be received at the channel output, when the sequence X is 
transmitted through the channel [A,,, B],-and let {Y} denote 
the set (space) of all such sequences Y. We denote by H,(X) 
the conditional entropy of the space {X}, averaged over Y. 

Each sequence X of length s=nt+r can be decomposed into 
t consecutive sequences a‘, a,---,a of length nm ‘and a 
“residual” sequence a* of length r<n. Clearly we can regard 
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the space {X} as the product of the t+1 spaces {a} (127241) 
and {a*}, where each of the first ¢ spaces has the structure 
of the space which we considered above. It is obvious that in 
general the t-+1 spaces will be mutually dependent. Because 
of the basic property of the entropy of a product space, we 


have 
Hy, (X)Z 3 H,, La? ]+Hy(a*), 
jel 


where Y, is any fixed sequence of Y. Thus, averaging over 
Y,, we find 


(16.7) HYAX)Z > Hy[a?]+ Hy(a*). 


Just as we decomposed the sequence X into t sequences a‘ 
and the residual sequence a*, we can decompose the sequence 
Y into t sequences 6” (14921) of length n and a residual 
sequence 4* of length r<m. Then the space {Y} will be the 
product of the t+1 spaces {8} (12jZt) and {@*}. The 
sequence A‘? (127 Zt) is the sequence at the channel output 
corresponding to the sequence a‘ at its input. Therefore the 
product space {a, B°?} (1272?) has the distribution w(a,,&,) 
considered above, and the space {@¥} has the distribution 7(@,). 
Let BY? (12jZt) denote the set of all the sequences {“” 
(1412t)and @*which make up Y, with the exception of B”. Then 
in order to fix a specific sequence Y, we must fix 6°” and BY”. 
In other words, we can regard the space {Y} as the product 
of the spaces {@‘”} and BY. (Here 7 is any of the numbers 
1,2,---t.) Therefore, by Lemma 1.2, we find for any j (127 2t) 


(16.8) H, le? J=H,o gp [a?] Z Ayla? ] = H,(a). 


On the other hand, the space {a*} obviously contains a’ events, 
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where a is the number of letters in the alphabet A,. But by 
one of the fundamental properties of the entropy the entropy 
of a finite space does not exceed the logarithm of the number 
of events in the space (see [1], [5]). - Therefore, for any 
choice of the sequence Y, 


Hy, (a*) Zr lga<niga 


and, consequently, the averaged conditional entropy H,(a*) 
satisfies 


(16.9) Hy(a*)<n lg a. 
Then by (16.8) and (16.9), the inequality (16.7) gives 
H,(X) <t Hfa)+n lg a, 


from which it follows by (16.6), that for arbitrarily small 2>0, 
for sufficiently large nm, and for any tX1 


A(X) <atn+nlgaZist+n lg a. 


Now we recall that in #10 we characterized the quantity 
H,(X), the “residual entropy” of the sequence X after it has 
been transmitted through the channel, as the amount of in- 
formation remaining in the sequence after transmission, i.e., 
lost during transmission. Since the amount of information 
contained in the sequence x prior to transmission is sH,, the 
amount of information transmitted is sH,—H,(X). But trans- 
mitting each sequence a‘ (of length ) requires the passage 
of n-++-m symbols through the channel, and just as many symbols 
are necessary for transmitting the sequence a*. Therefore the 
total number of symbols passing through the channel during 
the transmission of the sequence X is (¢+1) (n+m). Thus, 


one symbol at the channel outnut carries on the average an 
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amount of information 
H. _,— niga 
sH,—H,(X). sH,—as—nlga. sHy—as—nlga _ ° 8 
(¢+1)(n+m)~ t+ m\ ™m n m\ 
me+y(1+%) etm(1e2) (1+ F)(14+%) 


If we select n large enough so that ™<e and then t large 
n 


enough so that ad Za<e, then the right side will be larger than 
8 


H,—ia-—elga  H,—2a, 
(1+e)? 
if e sufficiently small. This very important result means that 
with the coding selected each letter received at the channel 
output brings on the average an amount of information as close 
as desired to that which one letter carries on the average at 
the source output. In other words, the transmission of infor- 
mation occurs at a rate arbitrarily close to that at which the 
information is emitted by the driving source. Of course, all 
this is only under the condition that H,<C, and for coding of 
sufficiently long sequences. In this regard, we note that if n 
must be taken too large, then the practical value of the coding 
method described is nullified, since one would have to wait too 
long to decode (decipher) the text received at the channel out- 
put. Thus, from the practical point of view, it would be of 
considerable interest to investigate the relation between 7 and 
A. Feinstein [4] obtained some interesting results in this 
direction, but we shall not go into them here. From the purely 
practical point of view, we must note again that in both the 
Feinstein and the Shannon methods the construction of a code 
with the required characteristics is not given; the existence of 
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such a code is proved, but no indication is given of how to 
actually find it. 

The result which we have obtained is the second Shannon 
theorem, and can be stated as follows. 
Theorem. 

Under the conditions of the theorem of #15, there exists a 

code such that the rate of transmission is as close to H, as 
desired. 


Conclusion 


The proof given above, which is based on the ideas of 
Shannon, McMillan, and Feinstein, is certainly long and compli- 
cated. However, it rests entirely on a single central idea which, 
when correctly understood, makes very transparent all the 
different steps of the complicated demonstration. Therefore, 
we consider it expedient to stress again, in somewhat more 
detail, the basic idea which lies behind all our considerations. 

The channel we are given is a noisy channel. This means 
that we cannot determine the sequences of symbols sent at the 
channel input from the sequences received at the channel output; 
because of noise, two different sequences at the channel input 
can give rise to the same sequence at the output. The situa- 
tion is considerably improved if we know two groups B, and 
B, of n-term sequences at the channel output such that 1) B, 
and B, do not contain any sequence in common, and 2) with 
overwhelming probability, the first of the two sequences which 
might be sent is mapped into one of the sequences of the group 
B,, and the second into one of the sequences of the group B,. 
In this case, by receiving a sequence of group B, at the channel 
output, we can almost be assured that the message was the 
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first of the two possible sequences, whereas if a sequence of 
the group B, is received at the output, then it is almost certain 
that the second sequence was the message. Thus, under the 
circumstances which we have described, these two transmitted 
sequences are distinguishable. Groups of three, four, and more 
distinguishable sequences (as yet we do not touch upon the 
question of the existence of such groups) can be determined 
at the channel input in exactly the same way. Suppose we 
were able to find a group consisting of a large number K of 
such distinguishable n-term sequences at the channel input. If 
we could limit ourselves to sending only sequences belonging 
to this group, then from the sequence received at the channel 
output, we would be able to determine almost without error 
the sequence transmitted. 

Can we do this? Obviously, in order to do so, it is necessary 
that the number L of different n-term sequences from the 
output of the given source which it is required to transmit 
through the channel, should not exceed K; for under this 
condition (and only under this condition) will it be possible for 
us to code the whole group of ZL such sequences which might 
have to be transmitted into our “distinguishable” group of K 
sequences at the channel input. Thus, the inequality L<K 
serves as a criterion of the possibility of transmitting almost 
without error, and our efforts must be directed towards making 
L as small as possible and K as large as possible. The first 
goal is attained by using McMillan’s theorem (Ch. II). By 
neglecting the “low probability” group (with total probability 
as small as desired) and, consequently, restricting ourselves to 
transmitting sequences from the “high probability” group. we 
greatly reduce the number of these sequences, and make it 
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approximately equal to 2°”, where H, is the entropy of the 
given source. The second problem is solved using Feinstein’s 
fundamental lemma (Ch. IV), which asserts the existence of 
distinguishable groups for which the number of terms is 
K>2*°-©, where C is the (ergodic) capacity of the channel, and 
éis a.positive number as small as desired. If H,<C, then we 
see at once that we have L<K for a suitable choice of the 
distinguishable groups, and our problem is solved. 

This is the central idea of the proof. All the rest is merely 
technique, although this technique sometimes requires great 
ingenuity in overcoming difficulties which arise. 

In all the formulations of the Shannon theorems which exist 
in the literature, these theorems are accompanied by the con- 
verse proposition: if H,>C, then coding with the required 
effect is impossible. All authors regard this statement as almost 
obvious and allot its proof only several lines. In the version of 
the proof which is given in this paper, I do not see the possibility 
of proving these converse propositions. The whole matter here 
is the definition of C, a definition which, in my opinion, is 
usually given rather carelessly. Jt is said that the capacity C 
of a given channel is the least upper bound of the rate of 
transmission over this channel of the output of all possible 
sources (the alphabet of which coincides with the input alphabet 
of the channel.) As far as I can see, here the words “all 
possible sources” must be replaced by the words “all possible 
ergodic sources’. (Consequently, the capacity which I define is 
called “ergodic”.) Without this addition, the proofs of both 
McMillan and Feinstein break down at their central point. But 
if C is understood to be the ergodic capacity of the given 
channel (as is done in this paper) then the converse propositions 
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mentioned above are not only not obvious, but apparently require 
essentially new ideas for their proof. I consider it possible 
that this difficulty can be overcome by using known limita- 


tions on admissible coding systems. 
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